Separating and uniting data

Author

Jeffrey R. Stevens

Published

February 27, 2023

For these exercises, we’ll use a new clean version of the dog breed traits data set.

Load tidyverse, import data from https://jeffreyrstevens.quarto.pub/dpavir/data/dog_breed_traits_clean.csv, and assign it to traits.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

traits <- read_csv("https://jeffreyrstevens.quarto.pub/dpavir/data/dog_breed_traits_clean.csv", show_col_types = FALSE)

Create traits2 which adds a coat column that combines coat_type and coat_length into single column delimited by “-”.

traits2 <- traits |> 
  unite("coat", contains("coat_"), sep = "-")

Split the coat column into type and length and keep the original coat column.

traits2 |> 
  separate(coat, into = c("type", "length"), remove = FALSE)

# A tibble: 197 × 9
   breed   affectionate children other_dogs shedding grooming coat  type  length
   <chr>          <dbl>    <dbl>      <dbl>    <dbl>    <dbl> <chr> <chr> <chr> 
 1 Retrie…            5        5          5        4        2 Doub… Doub… Short 
 2 French…            5        5          4        3        1 Smoo… Smoo… Short 
 3 German…            5        5          3        4        2 Doub… Doub… Medium
 4 Retrie…            5        5          5        4        2 Doub… Doub… Medium
 5 Bulldo…            4        3          3        3        3 Smoo… Smoo… Short 
 6 Poodles            5        5          3        1        4 Curl… Curly Long  
 7 Beagles            3        5          5        3        2 Smoo… Smoo… Short 
 8 Rottwe…            5        3          3        3        1 Smoo… Smoo… Short 
 9 Pointe…            5        5          4        3        2 Smoo… Smoo… Short 
10 Dachsh…            5        3          4        2        2 Smoo… Smoo… Short 
# ℹ 187 more rows

Create traits3 from traits that (1) removes the coat columns, (2) turns the ratings columns into long format, and (3) removes the children row for Bulldogs.

traits3 <- traits |> 
  select(-contains("coat_")) |> 
  pivot_longer(affectionate:grooming, names_to = "scale", values_to = "rating") |> 
  filter(breed != "Bulldogs" | scale != "children")

Create traits4 from traits3 that ensures a complete data set with all five ratings for all breeds (and fills in missing combinations with NA) and check for the missing Bulldog children row.

traits4 <- traits3 |> 
  complete(breed, scale)

How could we copy the rating from the previous row into the Bulldog children row to replace the NA? (Note this is not a good idea in this case!)

traits4 |> 
  fill(rating) |> 
  filter(breed == "Bulldogs")

# A tibble: 5 × 3
  breed    scale        rating
  <chr>    <chr>         <dbl>
1 Bulldogs affectionate      4
2 Bulldogs children          4
3 Bulldogs grooming          3
4 Bulldogs other_dogs        3
5 Bulldogs shedding          3

From traits, generate all combinations of coat type and length observed in the data, excluding NA.

traits |> 
  expand(nesting(coat_type, coat_length)) |> 
  drop_na()

# A tibble: 18 × 2
   coat_type coat_length
   <chr>     <chr>      
 1 Corded    Long       
 2 Corded    Medium     
 3 Curly     Long       
 4 Curly     Medium     
 5 Double    Long       
 6 Double    Medium     
 7 Double    Short      
 8 Hairless  Short      
 9 Rough     Long       
10 Rough     Medium     
11 Silky     Long       
12 Silky     Medium     
13 Smooth    Medium     
14 Smooth    Short      
15 Wavy      Long       
16 Wavy      Medium     
17 Wiry      Medium     
18 Wiry      Short

From traits, generate all possible combinations of coat type and length, excluding NA.

traits |> 
  expand(coat_type, coat_length) |> 
  drop_na()

# A tibble: 27 × 2
   coat_type coat_length
   <chr>     <chr>      
 1 Corded    Long       
 2 Corded    Medium     
 3 Corded    Short      
 4 Curly     Long       
 5 Curly     Medium     
 6 Curly     Short      
 7 Double    Long       
 8 Double    Medium     
 9 Double    Short      
10 Hairless  Long       
# ℹ 17 more rows