Pivoting data

Jeff Stevens

2023-02-24

Review

How do I wrangle that?

return subset of rows based on position in data frame
return subset of rows based on column values
return subset of columns based on position in data frame
return subset of columns based on name
reorder rows by column values
reorder columns manually
create new columns
aggregate rows with summary functions

Mental model of data analysis

The problem

What’s different between these data sets?

What needs to happen to create data2 from data1?

data1

# A tibble: 4 × 4
     id cond1  cond2 cond3
  <int> <dbl>  <dbl> <dbl>
1     1 0.104 0.808  0.898
2     2 0.584 0.661  0.997
3     3 0.675 0.891  0.293
4     4 0.687 0.0970 0.436

data2

   id condition   response
1   1     cond1 0.10362418
2   1     cond2 0.80773459
3   1     cond3 0.89752165
4   2     cond1 0.58419946
5   2     cond2 0.66144323
6   2     cond3 0.99686881
7   3     cond1 0.67497439
8   3     cond2 0.89147710
9   3     cond3 0.29255574
10  4     cond1 0.68710397
11  4     cond2 0.09700144
12  4     cond3 0.43590274

Tidy data

Mental model of tidy data

Tidying data with {tidyr} and {dplyr}

library(tidyr)
library(dplyr)

What is tidy data?

Each variable has its own column
Each observation has its own row
Each value has its own cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4b

# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Tidy data

Think about tidy from a model perspective
Tidyverse assumes tidy data
Easier to analyze and plot tidy data
But sometimes easier to store non-tidy data

Pivoting data

Wide data

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Why is `table4a` not tidy?

Wide data

Use `pivot_longer()`

pivot_longer(table4a, cols = c(`1999`, `2000`), 
             names_to = "year", values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Long data

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Why is `table2` not tidy?

Long data

Use `pivot_wider()`

pivot_wider(table2, id_cols = c("country", "year"), 
            names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Solving the problem

What code turns data1 into data2? And vice versa?

data1

# A tibble: 4 × 4
     id cond1  cond2 cond3
  <int> <dbl>  <dbl> <dbl>
1     1 0.104 0.808  0.898
2     2 0.584 0.661  0.997
3     3 0.675 0.891  0.293
4     4 0.687 0.0970 0.436

data2

   id condition   response
1   1     cond1 0.10362418
2   1     cond2 0.80773459
3   1     cond3 0.89752165
4   2     cond1 0.58419946
5   2     cond2 0.66144323
6   2     cond3 0.99686881
7   3     cond1 0.67497439
8   3     cond2 0.89147710
9   3     cond3 0.29255574
10  4     cond1 0.68710397
11  4     cond2 0.09700144
12  4     cond3 0.43590274

Let’s code!

Pivoting data [Rmd]

Pivoting data

Review

How do I wrangle that?

Mental model of data analysis

The problem

Tidy data

Mental model of tidy data

Tidying data with {tidyr} and {dplyr}

What is tidy data?

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Tidy data

Pivoting data

Pivoting data

Pivoting data

Wide data

Why is table4a not tidy?

Wide data

Use pivot_longer()

Long data

Why is table2 not tidy?

Long data

Use pivot_wider()

Solving the problem

Let’s code!

Why is `table4a` not tidy?

Use `pivot_longer()`

Why is `table2` not tidy?

Use `pivot_wider()`