data1
# A tibble: 4 × 4
id cond1 cond2 cond3
<int> <dbl> <dbl> <dbl>
1 1 0.935 0.896 0.422
2 2 0.599 0.493 0.539
3 3 0.933 0.798 0.724
4 4 0.337 0.890 0.705
2023-02-24
What’s different between these data sets?
What needs to happen to create data2
from data1
?
data1
# A tibble: 4 × 4
id cond1 cond2 cond3
<int> <dbl> <dbl> <dbl>
1 1 0.935 0.896 0.422
2 2 0.599 0.493 0.539
3 3 0.933 0.798 0.724
4 4 0.337 0.890 0.705
data2
id condition response
1 1 cond1 0.9349457
2 1 cond2 0.8955915
3 1 cond3 0.4221161
4 2 cond1 0.5992728
5 2 cond2 0.4928767
6 2 cond3 0.5392186
7 3 cond1 0.9325012
8 3 cond2 0.7983170
9 3 cond3 0.7238427
10 4 cond1 0.3368520
11 4 cond2 0.8904755
12 4 cond3 0.7046291
Each variable has its own column
Each observation has its own row
Each value has its own cell
table1
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
Think about tidy from a model perspective
Tidyverse assumes tidy data
Easier to analyze and plot tidy data
But sometimes easier to store non-tidy data
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4a
not tidy?pivot_longer()
pivot_longer(table4a, cols = c(`1999`, `2000`),
names_to = "year", values_to = "cases")
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table2
not tidy?pivot_wider()
pivot_wider(table2, id_cols = c("country", "year"),
names_from = type, values_from = count)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
What code turns data1
into data2
? And vice versa?
data1
# A tibble: 4 × 4
id cond1 cond2 cond3
<int> <dbl> <dbl> <dbl>
1 1 0.935 0.896 0.422
2 2 0.599 0.493 0.539
3 3 0.933 0.798 0.724
4 4 0.337 0.890 0.705
data2
id condition response
1 1 cond1 0.9349457
2 1 cond2 0.8955915
3 1 cond3 0.4221161
4 2 cond1 0.5992728
5 2 cond2 0.4928767
6 2 cond3 0.5392186
7 3 cond1 0.9325012
8 3 cond2 0.7983170
9 3 cond3 0.7238427
10 4 cond1 0.3368520
11 4 cond2 0.8904755
12 4 cond3 0.7046291