Merging rows

Author

Jeffrey R. Stevens

Published

March 3, 2023

For these exercises, we’ll use the dog breed traits and dog breed popularity rankings data sets.

Load tidyverse, import dog_breed_traits_clean.csv to traits, import dog_breed_ranks.csv to ranks, and import dog_breed_ranks.csv to popularity.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

traits <- read_csv(here::here("data/dog_breed_traits_clean.csv"), show_col_types = FALSE)
ranks <- read_csv(here::here("data/dog_breed_ranks.csv"), show_col_types = FALSE)
popularity <- read_csv(here::here("data/dog_breed_popularity.csv"), show_col_types = FALSE)

First, set a random seed by using set.seed(2). Then create a subset of ranks that is a random selection of 10% of the rows, sort by breed name, and assign to ranks2.

set.seed(2)
ranks2 <- slice_sample(ranks, prop = 0.1) |> 
  arrange(breed)

Use a filtering join to return the subset of traits that matches the breeds in ranks2 and assign this to traits2.

traits2 <- traits |> 
  semi_join(ranks2, by = "breed")

Use a filtering join to return the subset of traits that excludes the breeds in ranks2.

traits |> 
  anti_join(ranks2, by = "breed")

# A tibble: 178 × 8
   breed            affectionate children other_dogs shedding grooming coat_type
   <chr>                   <dbl>    <dbl>      <dbl>    <dbl>    <dbl> <chr>    
 1 Retrievers (Lab…            5        5          5        4        2 Double   
 2 French Bulldogs             5        5          4        3        1 Smooth   
 3 German Shepherd…            5        5          3        4        2 Double   
 4 Retrievers (Gol…            5        5          5        4        2 Double   
 5 Bulldogs                    4        3          3        3        3 Smooth   
 6 Poodles                     5        5          3        1        4 Curly    
 7 Beagles                     3        5          5        3        2 Smooth   
 8 Rottweilers                 5        3          3        3        1 Smooth   
 9 Pointers (Germa…            5        5          4        3        2 Smooth   
10 Dachshunds                  5        3          4        2        2 Smooth   
# ℹ 168 more rows
# ℹ 1 more variable: coat_length <chr>

Now we want to filter traits based on breeds in popularity. Notice that the breeds column in popularity is called Breed. This is problematic because the breed column in traits is called breed and names are case-sensitive. Use join_by() to filter traits by breeds in popularity. How many rows are there?

traits |> 
  semi_join(popularity, by = join_by(breed == Breed))

# A tibble: 195 × 8
   breed            affectionate children other_dogs shedding grooming coat_type
   <chr>                   <dbl>    <dbl>      <dbl>    <dbl>    <dbl> <chr>    
 1 Retrievers (Lab…            5        5          5        4        2 Double   
 2 French Bulldogs             5        5          4        3        1 Smooth   
 3 German Shepherd…            5        5          3        4        2 Double   
 4 Retrievers (Gol…            5        5          5        4        2 Double   
 5 Bulldogs                    4        3          3        3        3 Smooth   
 6 Poodles                     5        5          3        1        4 Curly    
 7 Beagles                     3        5          5        3        2 Smooth   
 8 Rottweilers                 5        3          3        3        1 Smooth   
 9 Pointers (Germa…            5        5          4        3        2 Smooth   
10 Dachshunds                  5        3          4        2        2 Smooth   
# ℹ 185 more rows
# ℹ 1 more variable: coat_length <chr>

Use filter() (not joins) to return the subset of traits that excludes the breeds in ranks2.

traits |> 
  filter(!breed %in% ranks2$breed)

# A tibble: 178 × 8
   breed            affectionate children other_dogs shedding grooming coat_type
   <chr>                   <dbl>    <dbl>      <dbl>    <dbl>    <dbl> <chr>    
 1 Retrievers (Lab…            5        5          5        4        2 Double   
 2 French Bulldogs             5        5          4        3        1 Smooth   
 3 German Shepherd…            5        5          3        4        2 Double   
 4 Retrievers (Gol…            5        5          5        4        2 Double   
 5 Bulldogs                    4        3          3        3        3 Smooth   
 6 Poodles                     5        5          3        1        4 Curly    
 7 Beagles                     3        5          5        3        2 Smooth   
 8 Rottweilers                 5        3          3        3        1 Smooth   
 9 Pointers (Germa…            5        5          4        3        2 Smooth   
10 Dachshunds                  5        3          4        2        2 Smooth   
# ℹ 168 more rows
# ℹ 1 more variable: coat_length <chr>

Append traits2 to the bottom of itself.

bind_rows(traits2, traits2)

# A tibble: 38 × 8
   breed            affectionate children other_dogs shedding grooming coat_type
   <chr>                   <dbl>    <dbl>      <dbl>    <dbl>    <dbl> <chr>    
 1 Yorkshire Terri…            5        5          3        1        5 Silky    
 2 Miniature Schna…            5        5          3        3        4 Wiry     
 3 Cane Corso                  4        3          3        2        1 Smooth   
 4 Weimaraners                 5        5          3        3        2 Smooth   
 5 Bullmastiffs                4        3          3        3        1 Smooth   
 6 Scottish Terrie…            5        3          2        2        3 Wiry     
 7 Chinese Shar-Pei            4        3          3        3        1 Smooth   
 8 Cardigan Welsh …            4        4          3        3        2 Double   
 9 Lhasa Apsos                 5        3          3        1        3 Silky    
10 Coton de Tulear             5        5          5        2        4 Double   
# ℹ 28 more rows
# ℹ 1 more variable: coat_length <chr>

Append traits2 to the right of itself.

bind_cols(traits2, traits2)

New names:
• `breed` -> `breed...1`
• `affectionate` -> `affectionate...2`
• `children` -> `children...3`
• `other_dogs` -> `other_dogs...4`
• `shedding` -> `shedding...5`
• `grooming` -> `grooming...6`
• `coat_type` -> `coat_type...7`
• `coat_length` -> `coat_length...8`
• `breed` -> `breed...9`
• `affectionate` -> `affectionate...10`
• `children` -> `children...11`
• `other_dogs` -> `other_dogs...12`
• `shedding` -> `shedding...13`
• `grooming` -> `grooming...14`
• `coat_type` -> `coat_type...15`
• `coat_length` -> `coat_length...16`

# A tibble: 19 × 16
   breed...1           affectionate...2 children...3 other_dogs...4 shedding...5
   <chr>                          <dbl>        <dbl>          <dbl>        <dbl>
 1 Yorkshire Terriers                 5            5              3            1
 2 Miniature Schnauze…                5            5              3            3
 3 Cane Corso                         4            3              3            2
 4 Weimaraners                        5            5              3            3
 5 Bullmastiffs                       4            3              3            3
 6 Scottish Terriers                  5            3              2            2
 7 Chinese Shar-Pei                   4            3              3            3
 8 Cardigan Welsh Cor…                4            4              3            3
 9 Lhasa Apsos                        5            3              3            1
10 Coton de Tulear                    5            5              5            2
11 Anatolian Shepherd…                1            3              3            3
12 Basenjis                           3            3              3            2
13 Miniature Bull Ter…                5            3              3            2
14 Setters (Gordon)                   5            3              3            3
15 Spaniels (Clumber)                 4            3              3            3
16 Pulik                              5            3              3            1
17 Salukis                            5            3              3            2
18 Barbets                            4            5              5            1
19 Pharaoh Hounds                     5            3              5            3
# ℹ 11 more variables: grooming...6 <dbl>, coat_type...7 <chr>,
#   coat_length...8 <chr>, breed...9 <chr>, affectionate...10 <dbl>,
#   children...11 <dbl>, other_dogs...12 <dbl>, shedding...13 <dbl>,
#   grooming...14 <dbl>, coat_type...15 <chr>, coat_length...16 <chr>

Append traits2 to the right of ranks2.

bind_cols(ranks2, traits2)

New names:
• `breed` -> `breed...1`
• `breed` -> `breed...10`

# A tibble: 19 × 17
   breed...1         `2013 Rank` `2014 Rank` `2015 Rank` `2016 Rank` `2017 Rank`
   <chr>                   <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 Anatolian Shephe…          93          94          92          84          86
 2 Barbets                    NA          NA          NA          NA          NA
 3 Basenjis                   85          86          87          88          84
 4 Bullmastiffs               41          45          43          48          51
 5 Cane Corso                 50          48          35          40          37
 6 Cardigan Welsh C…          75          78          76          69          68
 7 Chinese Shar-Pei           54          58          59          61          64
 8 Coton de Tulear            NA          31          85          80          81
 9 Lhasa Apsos                63          67          65          71          77
10 Miniature Bull T…         125         129         121         120         115
11 Miniature Schnau…          17          16          16          17          18
12 Pharaoh Hounds            160         164         171         168         174
13 Pulik                     136         151         154         159         142
14 Salukis                   115         134         132         125         123
15 Scottish Terriers          55          59          58          58          58
16 Setters (Gordon)          105         100         105         104         104
17 Spaniels (Clumbe…         131         143         134         144         140
18 Weimaraners                33          35          34          34          34
19 Yorkshire Terrie…           6           6           7           9           9
# ℹ 11 more variables: `2018 Rank` <dbl>, `2019 Rank` <dbl>, `2020 Rank` <dbl>,
#   breed...10 <chr>, affectionate <dbl>, children <dbl>, other_dogs <dbl>,
#   shedding <dbl>, grooming <dbl>, coat_type <chr>, coat_length <chr>

Why is this not a good idea? What would be a better way to achieve this?

left_join(ranks2, traits2)

Joining with `by = join_by(breed)`

# A tibble: 19 × 16
   breed `2013 Rank` `2014 Rank` `2015 Rank` `2016 Rank` `2017 Rank` `2018 Rank`
   <chr>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 Anat…          93          94          92          84          86          90
 2 Barb…          NA          NA          NA          NA          NA          NA
 3 Base…          85          86          87          88          84          87
 4 Bull…          41          45          43          48          51          51
 5 Cane…          50          48          35          40          37          32
 6 Card…          75          78          76          69          68          68
 7 Chin…          54          58          59          61          64          64
 8 Coto…          NA          31          85          80          81          81
 9 Lhas…          63          67          65          71          77          71
10 Mini…         125         129         121         120         115         110
11 Mini…          17          16          16          17          18          19
12 Phar…         160         164         171         168         174         172
13 Pulik         136         151         154         159         142         160
14 Salu…         115         134         132         125         123         120
15 Scot…          55          59          58          58          58          57
16 Sett…         105         100         105         104         104         115
17 Span…         131         143         134         144         140         143
18 Weim…          33          35          34          34          34          36
19 York…           6           6           7           9           9          10
# ℹ 9 more variables: `2019 Rank` <dbl>, `2020 Rank` <dbl>, affectionate <dbl>,
#   children <dbl>, other_dogs <dbl>, shedding <dbl>, grooming <dbl>,
#   coat_type <chr>, coat_length <chr>