Pivoting data

Jeff Stevens

2025-02-24

Review

Name that function

  • return subset of rows based on position in data frame
  • return subset of rows based on column values
  • return subset of columns based on position in data frame
  • return subset of columns based on name
  • reorder rows by column values
  • reorder columns manually
  • create new columns
  • aggregate rows with summary functions

Mental model of data analysis

The problem

What’s different between these data sets?

What needs to happen to create data2 from data1?

data1
   id condition   response
1   1     cond1 0.20826607
2   1     cond2 0.98142701
3   1     cond3 0.08186109
4   2     cond1 0.96427203
5   2     cond2 0.07697982
6   2     cond3 0.67866334
7   3     cond1 0.22544114
8   3     cond2 0.43792412
9   3     cond3 0.62341431
10  4     cond1 0.18529764
11  4     cond2 0.07576630
12  4     cond3 0.94208550
data2
# A tibble: 4 × 4
     id cond1  cond2  cond3
  <int> <dbl>  <dbl>  <dbl>
1     1 0.208 0.981  0.0819
2     2 0.964 0.0770 0.679 
3     3 0.225 0.438  0.623 
4     4 0.185 0.0758 0.942 

Today we …

Tidy data

Mental model of tidy data

Tidying data with {tidyr} and {dplyr}

What is tidy data?

  1. Each variable has its own column

  2. Each observation has its own row

  3. Each value has its own cell

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Is table1 tidy? Why/why not?

table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Is table2 tidy? Why/why not?

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Is table3 tidy? Why/why not?

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy data

Every variable is a column, every observation is a row, and every value is a cell

Is table4a tidy? Why/why not?

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Is table4b tidy? Why/why not?

table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Tidy data

  • Think about tidy from a model perspective

  • Tidyverse assumes tidy data

  • Easier to analyze and plot tidy data

  • But sometimes easier to store non-tidy data

Pivoting data

Pivoting data

Pivoting data

Wide data

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Why is table4a not tidy?

Wide data

Use pivot_longer()

pivot_longer(table4a, cols = c(`1999`, `2000`), 
             names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Long data

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Why is table2 not tidy?

Long data

Use pivot_wider()

pivot_wider(table2, id_cols = c("country", "year"), 
            names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Solving the problem

What code turns data1 into data2? And vice versa?

set.seed(20250224)
data1 <- data.frame(id = rep(1:4, each = 3), condition = paste0("cond", rep(1:3, times = 4)), response = runif(12)
)
data1
   id condition   response
1   1     cond1 0.20826607
2   1     cond2 0.98142701
3   1     cond3 0.08186109
4   2     cond1 0.96427203
5   2     cond2 0.07697982
6   2     cond3 0.67866334
7   3     cond1 0.22544114
8   3     cond2 0.43792412
9   3     cond3 0.62341431
10  4     cond1 0.18529764
11  4     cond2 0.07576630
12  4     cond3 0.94208550
data2
# A tibble: 4 × 4
     id cond1  cond2  cond3
  <int> <dbl>  <dbl>  <dbl>
1     1 0.208 0.981  0.0819
2     2 0.964 0.0770 0.679 
3     3 0.225 0.438  0.623 
4     4 0.185 0.0758 0.942 

Let’s code!

Pivoting data