For these exercises, we’ll use the dog breed traits and dog breed popularity rankings data sets.
Load tidyverse, import dog_breed_traits_clean.csv
to traits
, import dog_breed_ranks.csv
to ranks
, and import dog_breed_ranks.csv
to popularity
.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
traits <- read_csv (here:: here ("data/dog_breed_traits_clean.csv" ), show_col_types = FALSE )
ranks <- read_csv (here:: here ("data/dog_breed_ranks.csv" ), show_col_types = FALSE )
popularity <- read_csv (here:: here ("data/dog_breed_popularity.csv" ), show_col_types = FALSE )
First, set a random seed by using set.seed(2)
. Then create a subset of ranks
that is a random selection of 10% of the rows, sort by breed name, and assign to ranks2
.
set.seed (2 )
ranks2 <- slice_sample (ranks, prop = 0.1 ) |>
arrange (breed)
Use a filtering join to return the subset of traits
that matches the breeds in ranks2
and assign this to traits2
.
traits2 <- traits |>
semi_join (ranks2, by = "breed" )
Use a filtering join to return the subset of traits
that excludes the breeds in ranks2
.
traits |>
anti_join (ranks2, by = "breed" )
# A tibble: 178 × 8
breed affectionate children other_dogs shedding grooming coat_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Retrievers (Lab… 5 5 5 4 2 Double
2 French Bulldogs 5 5 4 3 1 Smooth
3 German Shepherd… 5 5 3 4 2 Double
4 Retrievers (Gol… 5 5 5 4 2 Double
5 Bulldogs 4 3 3 3 3 Smooth
6 Poodles 5 5 3 1 4 Curly
7 Beagles 3 5 5 3 2 Smooth
8 Rottweilers 5 3 3 3 1 Smooth
9 Pointers (Germa… 5 5 4 3 2 Smooth
10 Dachshunds 5 3 4 2 2 Smooth
# ℹ 168 more rows
# ℹ 1 more variable: coat_length <chr>
Now we want to filter traits
based on breeds in popularity
. Notice that the breeds column in popularity
is called Breed . This is problematic because the breed column in traits
is called breed and names are case-sensitive. Use join_by()
to filter traits
by breeds in popularity
. How many rows are there?
traits |>
semi_join (popularity, by = join_by (breed == Breed))
# A tibble: 195 × 8
breed affectionate children other_dogs shedding grooming coat_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Retrievers (Lab… 5 5 5 4 2 Double
2 French Bulldogs 5 5 4 3 1 Smooth
3 German Shepherd… 5 5 3 4 2 Double
4 Retrievers (Gol… 5 5 5 4 2 Double
5 Bulldogs 4 3 3 3 3 Smooth
6 Poodles 5 5 3 1 4 Curly
7 Beagles 3 5 5 3 2 Smooth
8 Rottweilers 5 3 3 3 1 Smooth
9 Pointers (Germa… 5 5 4 3 2 Smooth
10 Dachshunds 5 3 4 2 2 Smooth
# ℹ 185 more rows
# ℹ 1 more variable: coat_length <chr>
Use filter()
(not joins) to return the subset of traits
that excludes the breeds in ranks2
.
traits |>
filter (! breed %in% ranks2$ breed)
# A tibble: 178 × 8
breed affectionate children other_dogs shedding grooming coat_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Retrievers (Lab… 5 5 5 4 2 Double
2 French Bulldogs 5 5 4 3 1 Smooth
3 German Shepherd… 5 5 3 4 2 Double
4 Retrievers (Gol… 5 5 5 4 2 Double
5 Bulldogs 4 3 3 3 3 Smooth
6 Poodles 5 5 3 1 4 Curly
7 Beagles 3 5 5 3 2 Smooth
8 Rottweilers 5 3 3 3 1 Smooth
9 Pointers (Germa… 5 5 4 3 2 Smooth
10 Dachshunds 5 3 4 2 2 Smooth
# ℹ 168 more rows
# ℹ 1 more variable: coat_length <chr>
Append traits2
to the bottom of itself.
bind_rows (traits2, traits2)
# A tibble: 38 × 8
breed affectionate children other_dogs shedding grooming coat_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Yorkshire Terri… 5 5 3 1 5 Silky
2 Miniature Schna… 5 5 3 3 4 Wiry
3 Cane Corso 4 3 3 2 1 Smooth
4 Weimaraners 5 5 3 3 2 Smooth
5 Bullmastiffs 4 3 3 3 1 Smooth
6 Scottish Terrie… 5 3 2 2 3 Wiry
7 Chinese Shar-Pei 4 3 3 3 1 Smooth
8 Cardigan Welsh … 4 4 3 3 2 Double
9 Lhasa Apsos 5 3 3 1 3 Silky
10 Coton de Tulear 5 5 5 2 4 Double
# ℹ 28 more rows
# ℹ 1 more variable: coat_length <chr>
Append traits2
to the right of itself.
bind_cols (traits2, traits2)
New names:
• `breed` -> `breed...1`
• `affectionate` -> `affectionate...2`
• `children` -> `children...3`
• `other_dogs` -> `other_dogs...4`
• `shedding` -> `shedding...5`
• `grooming` -> `grooming...6`
• `coat_type` -> `coat_type...7`
• `coat_length` -> `coat_length...8`
• `breed` -> `breed...9`
• `affectionate` -> `affectionate...10`
• `children` -> `children...11`
• `other_dogs` -> `other_dogs...12`
• `shedding` -> `shedding...13`
• `grooming` -> `grooming...14`
• `coat_type` -> `coat_type...15`
• `coat_length` -> `coat_length...16`
# A tibble: 19 × 16
breed...1 affectionate...2 children...3 other_dogs...4 shedding...5
<chr> <dbl> <dbl> <dbl> <dbl>
1 Yorkshire Terriers 5 5 3 1
2 Miniature Schnauze… 5 5 3 3
3 Cane Corso 4 3 3 2
4 Weimaraners 5 5 3 3
5 Bullmastiffs 4 3 3 3
6 Scottish Terriers 5 3 2 2
7 Chinese Shar-Pei 4 3 3 3
8 Cardigan Welsh Cor… 4 4 3 3
9 Lhasa Apsos 5 3 3 1
10 Coton de Tulear 5 5 5 2
11 Anatolian Shepherd… 1 3 3 3
12 Basenjis 3 3 3 2
13 Miniature Bull Ter… 5 3 3 2
14 Setters (Gordon) 5 3 3 3
15 Spaniels (Clumber) 4 3 3 3
16 Pulik 5 3 3 1
17 Salukis 5 3 3 2
18 Barbets 4 5 5 1
19 Pharaoh Hounds 5 3 5 3
# ℹ 11 more variables: grooming...6 <dbl>, coat_type...7 <chr>,
# coat_length...8 <chr>, breed...9 <chr>, affectionate...10 <dbl>,
# children...11 <dbl>, other_dogs...12 <dbl>, shedding...13 <dbl>,
# grooming...14 <dbl>, coat_type...15 <chr>, coat_length...16 <chr>
Append traits2
to the right of ranks2
.
bind_cols (ranks2, traits2)
New names:
• `breed` -> `breed...1`
• `breed` -> `breed...10`
# A tibble: 19 × 17
breed...1 `2013 Rank` `2014 Rank` `2015 Rank` `2016 Rank` `2017 Rank`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Anatolian Shephe… 93 94 92 84 86
2 Barbets NA NA NA NA NA
3 Basenjis 85 86 87 88 84
4 Bullmastiffs 41 45 43 48 51
5 Cane Corso 50 48 35 40 37
6 Cardigan Welsh C… 75 78 76 69 68
7 Chinese Shar-Pei 54 58 59 61 64
8 Coton de Tulear NA 31 85 80 81
9 Lhasa Apsos 63 67 65 71 77
10 Miniature Bull T… 125 129 121 120 115
11 Miniature Schnau… 17 16 16 17 18
12 Pharaoh Hounds 160 164 171 168 174
13 Pulik 136 151 154 159 142
14 Salukis 115 134 132 125 123
15 Scottish Terriers 55 59 58 58 58
16 Setters (Gordon) 105 100 105 104 104
17 Spaniels (Clumbe… 131 143 134 144 140
18 Weimaraners 33 35 34 34 34
19 Yorkshire Terrie… 6 6 7 9 9
# ℹ 11 more variables: `2018 Rank` <dbl>, `2019 Rank` <dbl>, `2020 Rank` <dbl>,
# breed...10 <chr>, affectionate <dbl>, children <dbl>, other_dogs <dbl>,
# shedding <dbl>, grooming <dbl>, coat_type <chr>, coat_length <chr>
Why is this not a good idea? What would be a better way to achieve this?
left_join (ranks2, traits2)
Joining with `by = join_by(breed)`
# A tibble: 19 × 16
breed `2013 Rank` `2014 Rank` `2015 Rank` `2016 Rank` `2017 Rank` `2018 Rank`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Anat… 93 94 92 84 86 90
2 Barb… NA NA NA NA NA NA
3 Base… 85 86 87 88 84 87
4 Bull… 41 45 43 48 51 51
5 Cane… 50 48 35 40 37 32
6 Card… 75 78 76 69 68 68
7 Chin… 54 58 59 61 64 64
8 Coto… NA 31 85 80 81 81
9 Lhas… 63 67 65 71 77 71
10 Mini… 125 129 121 120 115 110
11 Mini… 17 16 16 17 18 19
12 Phar… 160 164 171 168 174 172
13 Pulik 136 151 154 159 142 160
14 Salu… 115 134 132 125 123 120
15 Scot… 55 59 58 58 58 57
16 Sett… 105 100 105 104 104 115
17 Span… 131 143 134 144 140 143
18 Weim… 33 35 34 34 34 36
19 York… 6 6 7 9 9 10
# ℹ 9 more variables: `2019 Rank` <dbl>, `2020 Rank` <dbl>, affectionate <dbl>,
# children <dbl>, other_dogs <dbl>, shedding <dbl>, grooming <dbl>,
# coat_type <chr>, coat_length <chr>