Piping

Jeff Stevens

2025-02-17

Review

Data wrangling

Piping

Set-up

Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Without pipes

myflights <- flights[c("year", "month", "day", "air_time", "distance", 
                       "hour", "minute")]
myflights$month <- as.character(myflights$month)
myflights$month <- ifelse(myflights$month < 10, 
                          paste0("0", myflights$month), myflights$month)
myflights$day <- ifelse(myflights$day < 10, 
                        paste0("0", myflights$day), myflights$day)
myflights$date <- paste(myflights$year, myflights$month, myflights$day, 
                        sep = "-")
myflights$speed <- myflights$distance / myflights$air_time * 60
myflights <- myflights[c("year", "month", "day", "date", "air_time", 
                         "distance", "speed", "hour", "minute")]

What do you like and dislike about this?

Pipelines

myflights2 <- flights |> 
  select(year:day, air_time, distance, hour, minute) |> 
  mutate(month = as.character(month),
         month = if_else(month < 10, paste0("0", month), as.character(month)),
         day = if_else(day < 10, paste0("0", day), as.character(day)),
         date = paste(year, month, day, sep = "-"),
         speed = distance / air_time * 60) |> 
  select(year:day, date, air_time, distance, speed, everything())

What do you like and dislike about this?

Pipelines

myflights3 <- flights |> 
  select(year:day, air_time, distance, hour, minute) |> 
  mutate(month = as.character(month),
         month = if_else(month < 10, paste0("0", month), as.character(month)),
         day = if_else(day < 10, paste0("0", day), as.character(day)),
         date = paste(year, month, day, sep = "-"),
         .after = day) |> 
  mutate(speed = distance / air_time * 60,
         .after = distance)

What do you like and dislike about this?

Pipeline comparison

identical(myflights, myflights2)
[1] TRUE
identical(myflights, myflights3)
[1] TRUE
identical(myflights2, myflights3)
[1] TRUE

Character counts

Pipeline Characters
myflights 566
myflights2 423
myflights3 406

Pipes

Base R pipe

|>

  • added in R 4.1.0 but key functionality started in 4.2.0
  • loaded in any R session
  • works following most base R and tidyverse functions

Pipes

Base R pipe

Tidyverse pipe %>%

  • from {magrittr} package
  • loaded with any tidyverse package
  • works following tidyverse verbs
  • Hadley Wickham recommends using the base R pipe |>, so we’ll use that here.

Piping basics

Start with the data object…

flights |> 
  select(year:dep_delay, origin) |> # include these columns
  select(!sched_dep_time) # exclude this column

Or use data object as the first argument…

select(flights, year:dep_delay, origin) |> # include these columns
  select(!sched_dep_time) # exclude this column

But don’t use data object after first pipe

select(flights, year:dep_delay, origin) |> # include these columns
  select(flights, !sched_dep_time) # exclude this column
Error in `select()`:
! Can't select columns with `flights`.
✖ `flights` must be numeric or character, not a <tbl_df/tbl/data.frame> object.

Piping basics

Like any object, assigning it does not output to console

myflights <- flights |> 
  select(year:dep_delay, origin) |>
  select(!sched_dep_time)

But omitting assignment does

flights |> 
  select(year:dep_delay, origin) |>
  select(!sched_dep_time)
# A tibble: 336,776 × 6
    year month   day dep_time dep_delay origin
   <int> <int> <int>    <int>     <dbl> <chr> 
 1  2013     1     1      517         2 EWR   
 2  2013     1     1      533         4 LGA   
 3  2013     1     1      542         2 JFK   
 4  2013     1     1      544        -1 JFK   
 5  2013     1     1      554        -6 LGA   
 6  2013     1     1      554        -4 EWR   
 7  2013     1     1      555        -5 EWR   
 8  2013     1     1      557        -3 LGA   
 9  2013     1     1      557        -3 JFK   
10  2013     1     1      558        -2 LGA   
# ℹ 336,766 more rows

Piping basics

As does wrapping whole pipeline in parentheses

(myflights <- flights |> 
  select(year:dep_delay, origin) |>
  select(!sched_dep_time))
# A tibble: 336,776 × 6
    year month   day dep_time dep_delay origin
   <int> <int> <int>    <int>     <dbl> <chr> 
 1  2013     1     1      517         2 EWR   
 2  2013     1     1      533         4 LGA   
 3  2013     1     1      542         2 JFK   
 4  2013     1     1      544        -1 JFK   
 5  2013     1     1      554        -6 LGA   
 6  2013     1     1      554        -4 EWR   
 7  2013     1     1      555        -5 EWR   
 8  2013     1     1      557        -3 LGA   
 9  2013     1     1      557        -3 JFK   
10  2013     1     1      558        -2 LGA   
# ℹ 336,766 more rows

Piping function order

flights |> 
  select(month:day, contains("_time")) |> 
  mutate(across(contains("_time"), as.character)) |> 
  head(n = 2)
# A tibble: 2 × 7
  month   day dep_time sched_dep_time arr_time sched_arr_time air_time
  <int> <int> <chr>    <chr>          <chr>    <chr>          <chr>   
1     1     1 517      515            830      819            227     
2     1     1 533      529            850      830            227     

OR

flights |> 
  mutate(across(contains("_time"), as.character)) |> 
  select(month:day, contains("_time")) |> 
  head(n = 2)
# A tibble: 2 × 7
  month   day dep_time sched_dep_time arr_time sched_arr_time air_time
  <int> <int> <chr>    <chr>          <chr>    <chr>          <chr>   
1     1     1 517      515            830      819            227     
2     1     1 533      529            850      830            227     

What is a pipe doing?

flights |> 
  select(month:day, contains("_time")) |> 
  mutate(across(contains("_time"), as.character)) |> 
  head(n = 2)
# A tibble: 2 × 7
  month   day dep_time sched_dep_time arr_time sched_arr_time air_time
  <int> <int> <chr>    <chr>          <chr>    <chr>          <chr>   
1     1     1 517      515            830      819            227     
2     1     1 533      529            850      830            227     

is equivalent to

  head(mutate(select(flights, month:day, contains("_time")), across(contains("_time"), as.character)), n = 2)
# A tibble: 2 × 7
  month   day dep_time sched_dep_time arr_time sched_arr_time air_time
  <int> <int> <chr>    <chr>          <chr>    <chr>          <chr>   
1     1     1 517      515            830      819            227     
2     1     1 533      529            850      830            227     

Advanced piping

  • Sometimes, non-tidyverse functions don’t take the data object as the first argument

  • This requires a “placeholder” signaling where the data object goes

  • The placeholder for the |> pipe is _

  • The placeholder for the %>% pipe is .

Advanced piping

Base R pipe

mtcars |> 
  select(mpg, cyl) |> 
  lm(mpg ~ cyl)
Error in as.data.frame.default(data): cannot coerce class '"formula"' to a data.frame
mtcars |> 
  select(mpg, cyl) |> 
  lm(mpg ~ cyl, data = _)

Call:
lm(formula = mpg ~ cyl, data = select(mtcars, mpg, cyl))

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  
  • You must specify the argument name to use placeholder

Advanced piping

tidyverse pipe

mtcars %>% 
  select(mpg, cyl) %>% 
  lm(mpg ~ cyl, data = .)

Call:
lm(formula = mpg ~ cyl, data = .)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  

Rename with a vector

Base R

new_names <- letters[1:ncol(flights)]
flights2 <- flights
colnames(flights2) <- new_names
head(flights2)
# A tibble: 6 × 19
      a     b     c     d     e     f     g     h     i j         k l      m    
  <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>  <chr>
1  2013     1     1   517   515     2   830   819    11 UA     1545 N14228 EWR  
2  2013     1     1   533   529     4   850   830    20 UA     1714 N24211 LGA  
3  2013     1     1   542   540     2   923   850    33 AA     1141 N619AA JFK  
4  2013     1     1   544   545    -1  1004  1022   -18 B6      725 N804JB JFK  
5  2013     1     1   554   600    -6   812   837   -25 DL      461 N668DN LGA  
6  2013     1     1   554   558    -4   740   728    12 UA     1696 N39463 EWR  
# ℹ 6 more variables: n <chr>, o <dbl>, p <dbl>, q <dbl>, r <dbl>, s <dttm>

Rename with a vector

Tidyverse

rename_with(flights2, ~ new_names)
# A tibble: 336,776 × 19
       a     b     c     d     e     f     g     h     i j         k l     m    
   <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
 1  2013     1     1   517   515     2   830   819    11 UA     1545 N142… EWR  
 2  2013     1     1   533   529     4   850   830    20 UA     1714 N242… LGA  
 3  2013     1     1   542   540     2   923   850    33 AA     1141 N619… JFK  
 4  2013     1     1   544   545    -1  1004  1022   -18 B6      725 N804… JFK  
 5  2013     1     1   554   600    -6   812   837   -25 DL      461 N668… LGA  
 6  2013     1     1   554   558    -4   740   728    12 UA     1696 N394… EWR  
 7  2013     1     1   555   600    -5   913   854    19 B6      507 N516… EWR  
 8  2013     1     1   557   600    -3   709   723   -14 EV     5708 N829… LGA  
 9  2013     1     1   557   600    -3   838   846    -8 B6       79 N593… JFK  
10  2013     1     1   558   600    -2   753   745     8 AA      301 N3AL… LGA  
# ℹ 336,766 more rows
# ℹ 6 more variables: n <chr>, o <dbl>, p <dbl>, q <dbl>, r <dbl>, s <dttm>

Rename with a function

rename_with(flights2, toupper)
# A tibble: 336,776 × 19
       A     B     C     D     E     F     G     H     I J         K L     M    
   <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
 1  2013     1     1   517   515     2   830   819    11 UA     1545 N142… EWR  
 2  2013     1     1   533   529     4   850   830    20 UA     1714 N242… LGA  
 3  2013     1     1   542   540     2   923   850    33 AA     1141 N619… JFK  
 4  2013     1     1   544   545    -1  1004  1022   -18 B6      725 N804… JFK  
 5  2013     1     1   554   600    -6   812   837   -25 DL      461 N668… LGA  
 6  2013     1     1   554   558    -4   740   728    12 UA     1696 N394… EWR  
 7  2013     1     1   555   600    -5   913   854    19 B6      507 N516… EWR  
 8  2013     1     1   557   600    -3   709   723   -14 EV     5708 N829… LGA  
 9  2013     1     1   557   600    -3   838   846    -8 B6       79 N593… JFK  
10  2013     1     1   558   600    -2   753   745     8 AA      301 N3AL… LGA  
# ℹ 336,766 more rows
# ℹ 6 more variables: N <chr>, O <dbl>, P <dbl>, Q <dbl>, R <dbl>, S <dttm>

Let’s code!

Piping