Pivoting data

Jeff Stevens

2023-02-24

Review

How do I wrangle that?

  • return subset of rows based on position in data frame
  • return subset of rows based on column values
  • return subset of columns based on position in data frame
  • return subset of columns based on name
  • reorder rows by column values
  • reorder columns manually
  • create new columns
  • aggregate rows with summary functions

Mental model of data analysis

The problem

What’s different between these data sets?

What needs to happen to create data2 from data1?

data1
# A tibble: 4 × 4
     id cond1 cond2 cond3
  <int> <dbl> <dbl> <dbl>
1     1 0.935 0.896 0.422
2     2 0.599 0.493 0.539
3     3 0.933 0.798 0.724
4     4 0.337 0.890 0.705
data2
   id condition  response
1   1     cond1 0.9349457
2   1     cond2 0.8955915
3   1     cond3 0.4221161
4   2     cond1 0.5992728
5   2     cond2 0.4928767
6   2     cond3 0.5392186
7   3     cond1 0.9325012
8   3     cond2 0.7983170
9   3     cond3 0.7238427
10  4     cond1 0.3368520
11  4     cond2 0.8904755
12  4     cond3 0.7046291

Tidy data

Mental model of tidy data

Tidying data with {tidyr} and {dplyr}

What is tidy data?

  1. Each variable has its own column

  2. Each observation has its own row

  3. Each value has its own cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Tidy data

  • Think about tidy from a model perspective

  • Tidyverse assumes tidy data

  • Easier to analyze and plot tidy data

  • But sometimes easier to store non-tidy data

Pivoting data

Pivoting data

Pivoting data

Wide data

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Why is table4a not tidy?

Wide data

Use pivot_longer()

pivot_longer(table4a, cols = c(`1999`, `2000`), 
             names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Long data

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Why is table2 not tidy?

Long data

Use pivot_wider()

pivot_wider(table2, id_cols = c("country", "year"), 
            names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Solving the problem

What code turns data1 into data2? And vice versa?

data1
# A tibble: 4 × 4
     id cond1 cond2 cond3
  <int> <dbl> <dbl> <dbl>
1     1 0.935 0.896 0.422
2     2 0.599 0.493 0.539
3     3 0.933 0.798 0.724
4     4 0.337 0.890 0.705
data2
   id condition  response
1   1     cond1 0.9349457
2   1     cond2 0.8955915
3   1     cond3 0.4221161
4   2     cond1 0.5992728
5   2     cond2 0.4928767
6   2     cond3 0.5392186
7   3     cond1 0.9325012
8   3     cond2 0.7983170
9   3     cond3 0.7238427
10  4     cond1 0.3368520
11  4     cond2 0.8904755
12  4     cond3 0.7046291

Let’s code!

Pivoting data [Rmd]