Matching patterns

Jeff Stevens

2025-03-10

Introduction

The problem

What’s different between these data sets?

What is needed to create data2 from data1?

data1
# A tibble: 12 × 3
   time      species  resp 
   <chr>     <chr>    <chr>
 1 early-day dogfish  yes  
 2 mid-day   bear dog no   
 3 late-day  dog      yes  
 4 daytime   dogfish  no   
 5 early-day cat      yes  
 6 mid-day   cat      no   
 7 late-day  dogfish  no   
 8 daytime   bear dog no   
 9 early-day dogfish  <NA> 
10 mid-day   catfish  yes  
11 late-day  cat      yes  
12 daytime   bear dog yes  
data2
# A tibble: 8 × 3
  time      species  resp   
  <chr>     <chr>    <chr>  
1 early-Day dogfish  yes    
2 mid-Day   bear dog no     
3 late-Day  dog      yes    
4 daytime   dogfish  no     
5 late-Day  dogfish  no     
6 daytime   bear dog no     
7 early-Day dogfish  no data
8 daytime   bear dog yes    

Set-up

Mental model

Strings with {stringr}

Patterns

Regular expressions

Concise and powerful language for describing patterns within strings

(regex for short)

Regular expressions

Here’s the regex I used to detect IP addresses in excluder:

^(?:(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(\.(?!$)|$)){4}$

Matching strings

View string patterns with stringr::str_view()

(x <- c("apple", "banana", "pear", NA))
[1] "apple"  "banana" "pear"   NA      
str_view(x, "a")
[1] │ <a>pple
[2] │ b<a>n<a>n<a>
[3] │ pe<a>r

Regex 101

. is wildcard

str_view(x, ".a.")
[2] │ <ban>ana
[3] │ p<ear>

Regex 101

^ to match the start of the string (like starts_with())

$ to match the end of the string (like ends_with())

str_view(x, "^a")
[1] │ <a>pple
str_view(x, "a$")
[2] │ banan<a>

Regex 101

| matches one pattern OR another (e.g., this|that)

str_view(x, "ap|an|ar")
[1] │ <ap>ple
[2] │ b<an><an>a
[3] │ pe<ar>

Wrap character groups in ()

str_view("Are you here or are you there?", "(A|a)re")  
[1] │ <Are> you here or <are> you there?

Regex 101

\d matches any digit

# view digits
str_view("March 10, 2025", "\\d")
[1] │ March <1><0>, <2><0><2><5>

Regex 101

[abc] matches individual characters (a, b, or c)

# view everything with ab or a<space>
str_view(c("abc", "a.c", "a*c", "a c"), "a[b ]")
[1] │ <ab>c
[4] │ <a >c

Regex 101

[^abc] matches individual characters except a, b, or c

# view everything except ab and a<space>
str_view(c("abc", "a.c", "a*c", "a c"), "a[^b ]")
[2] │ <a.>c
[3] │ <a*>c
# view everything except digits
str_view("March 10, 2020", "[^\\d]")  
[1] │ <M><a><r><c><h>< >10<,>< >2020

Detecting and extracting patterns

Detecting pattern matches

Detect matching elements with stringr::str_detect()

Returns logical vector of elements that match pattern

x
[1] "apple"  "banana" "pear"   NA      
str_detect(x, "e")  # grepl() in base R
[1]  TRUE FALSE  TRUE    NA
sum(str_detect(x, "e"), na.rm = TRUE)  # sum matching elements
[1] 2
mean(str_detect(x, "e"), na.rm = TRUE)  # calculate proportion of matches
[1] 0.6666667

Extracting pattern matches

Extract observations matching pattern with filter() and str_detect()

penguins |>
  filter(str_detect(sex, "male")) |>  # select observations that include "male"
  select(species, island, sex)
# A tibble: 333 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen female
 5 Adelie  Torgersen male  
 6 Adelie  Torgersen female
 7 Adelie  Torgersen male  
 8 Adelie  Torgersen female
 9 Adelie  Torgersen male  
10 Adelie  Torgersen male  
# ℹ 323 more rows

Extracting pattern matches

Extract elements that match a pattern with stringr::str_subset()

Returns elements that match pattern

head(words, n = 20)
 [1] "a"         "able"      "about"     "absolute"  "accept"    "account"  
 [7] "achieve"   "across"    "act"       "active"    "actual"    "add"      
[13] "address"   "admit"     "advertise" "affect"    "afford"    "after"    
[19] "afternoon" "again"    
str_subset(words, "^rec")  # select elements starting with "rec"
[1] "receive"   "recent"    "reckon"    "recognize" "recommend" "record"   
str_subset(words, "ing$")  # select elements ending with "ing"
[1] "bring"   "during"  "evening" "king"    "meaning" "morning" "ring"   
[8] "sing"    "thing"  

Replacing patterns

Replacing pattern matches

Replace matches with new strings with stringr::str_replace() and stringr::str_replace_all()

str_replace(x, "[aeiou]", "-")  # replace only first instance of match
[1] "-pple"  "b-nana" "p-ar"   NA      
str_replace_all(x, "[aeiou]", "-")  # replace all matches
[1] "-ppl-"  "b-n-n-" "p--r"   NA      
str_replace_all(x, "[^aeiou]", "-")  # replace all matches
[1] "a---e"  "-a-a-a" "-ea-"   NA      

How do we do this based on position instead of pattern?

Replacing pattern matches

You can use this to recode character variables, but…

set.seed(50)
penguins |>
  mutate(new_island = str_replace(island, "Torgersen", "Party")) |> 
  select(species, island, new_island) |> 
  slice_sample(n = 6)
# A tibble: 6 × 3
  species   island    new_island
  <fct>     <fct>     <chr>     
1 Adelie    Torgersen Party     
2 Chinstrap Dream     Dream     
3 Adelie    Dream     Dream     
4 Chinstrap Dream     Dream     
5 Gentoo    Biscoe    Biscoe    
6 Gentoo    Biscoe    Biscoe    

It coerces to character data types

I use this A LOT to clean up text data

Replacing NA

Replace NA with another value with dplyr::replace_na()

x
[1] "apple"  "banana" "pear"   NA      
replace_na(x, "Missing")
[1] "apple"   "banana"  "pear"    "Missing"

Splitting strings

Splitting strings

Split a string up into pieces with str_split()

head(sentences, n = 2)
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
sentences |>
  head(2) |>
  str_split(" ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

Notice this produces a list. Why?

Splitting strings

Convert to matrix with simplify

sentences[c(1:2, 5)] |>
  str_split(" ", simplify = TRUE)
     [,1]   [,2]    [,3]    [,4]     [,5]  [,6]    [,7]     [,8]         
[1,] "The"  "birch" "canoe" "slid"   "on"  "the"   "smooth" "planks."    
[2,] "Glue" "the"   "sheet" "to"     "the" "dark"  "blue"   "background."
[3,] "Rice" "is"    "often" "served" "in"  "round" "bowls." ""           

Solving the problem

library(tidyverse)
nrows <- 12
set.seed(100)
data1 <- tibble(time = rep(c("early-day", "mid-day", "late-day", "daytime"), 
                           times = 3), 
                species = sample(c("dog", "dogfish", "bear dog", "cat", "catfish"), 
                                 nrows, replace = TRUE), 
                resp = sample(c("yes", "no", "yes", "no", NA), nrows, 
                              replace = TRUE))

Solving the problem

data1
# A tibble: 12 × 3
   time      species  resp 
   <chr>     <chr>    <chr>
 1 early-day dogfish  yes  
 2 mid-day   bear dog no   
 3 late-day  dog      yes  
 4 daytime   dogfish  no   
 5 early-day cat      yes  
 6 mid-day   cat      no   
 7 late-day  dogfish  no   
 8 daytime   bear dog no   
 9 early-day dogfish  <NA> 
10 mid-day   catfish  yes  
11 late-day  cat      yes  
12 daytime   bear dog yes  
data2
# A tibble: 8 × 3
  time      species  resp   
  <chr>     <chr>    <chr>  
1 early-Day dogfish  yes    
2 mid-Day   bear dog no     
3 late-Day  dog      yes    
4 daytime   dogfish  no     
5 late-Day  dogfish  no     
6 daytime   bear dog no     
7 early-Day dogfish  no data
8 daytime   bear dog yes    

Let’s code!

Matching patterns