Matching patterns

Jeff Stevens

2025-03-10

Introduction

The problem

What’s different between these data sets?

What is needed to create data2 from data1?

data1

# A tibble: 12 × 3
   time      species  resp 
   <chr>     <chr>    <chr>
 1 early-day dogfish  yes  
 2 mid-day   bear dog no   
 3 late-day  dog      yes  
 4 daytime   dogfish  no   
 5 early-day cat      yes  
 6 mid-day   cat      no   
 7 late-day  dogfish  no   
 8 daytime   bear dog no   
 9 early-day dogfish  <NA> 
10 mid-day   catfish  yes  
11 late-day  cat      yes  
12 daytime   bear dog yes

data2

# A tibble: 8 × 3
  time      species  resp   
  <chr>     <chr>    <chr>  
1 early-Day dogfish  yes    
2 mid-Day   bear dog no     
3 late-Day  dog      yes    
4 daytime   dogfish  no     
5 late-Day  dogfish  no     
6 daytime   bear dog no     
7 early-Day dogfish  no data
8 daytime   bear dog yes

Set-up

library(tidyverse)
library(palmerpenguins)

Mental model

Strings with {stringr}

library(stringr)

Patterns

Regular expressions

Concise and powerful language for describing patterns within strings

(regex for short)

Regular expressions

Here’s the regex I used to detect IP addresses in excluder:

^(?:(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(\.(?!$)|$)){4}$

Matching strings

View string patterns with stringr::str_view()

(x <- c("apple", "banana", "pear", NA))

[1] "apple"  "banana" "pear"   NA

str_view(x, "a")

[1] │ <a>pple
[2] │ b<a>n<a>n<a>
[3] │ pe<a>r

Regex 101

. is wildcard

str_view(x, ".a.")

[2] │ <ban>ana
[3] │ p<ear>

Regex 101

^ to match the start of the string (like starts_with())

$ to match the end of the string (like ends_with())

str_view(x, "^a")

[1] │ <a>pple

str_view(x, "a$")

[2] │ banan<a>

Regex 101

| matches one pattern OR another (e.g., this|that)

str_view(x, "ap|an|ar")

[1] │ <ap>ple
[2] │ b<an><an>a
[3] │ pe<ar>

Wrap character groups in ()

str_view("Are you here or are you there?", "(A|a)re")

[1] │ <Are> you here or <are> you there?

Regex 101

\d matches any digit

# view digits
str_view("March 10, 2025", "\\d")

[1] │ March <1><0>, <2><0><2><5>

Regex 101

[abc] matches individual characters (a, b, or c)

# view everything with ab or a<space>
str_view(c("abc", "a.c", "a*c", "a c"), "a[b ]")

[1] │ <ab>c
[4] │ <a >c

Regex 101

[^abc] matches individual characters except a, b, or c

# view everything except ab and a<space>
str_view(c("abc", "a.c", "a*c", "a c"), "a[^b ]")

[2] │ <a.>c
[3] │ <a*>c

# view everything except digits
str_view("March 10, 2020", "[^\\d]")

[1] │ <M><a><r><c><h>< >10<,>< >2020

Detecting and extracting patterns

Detecting pattern matches

Detect matching elements with stringr::str_detect()

Returns logical vector of elements that match pattern

[1] "apple"  "banana" "pear"   NA

str_detect(x, "e")  # grepl() in base R

[1]  TRUE FALSE  TRUE    NA

sum(str_detect(x, "e"), na.rm = TRUE)  # sum matching elements

[1] 2

mean(str_detect(x, "e"), na.rm = TRUE)  # calculate proportion of matches

[1] 0.6666667

Extracting pattern matches

Extract observations matching pattern with filter() and str_detect()

penguins |>
  filter(str_detect(sex, "male")) |>  # select observations that include "male"
  select(species, island, sex)

# A tibble: 333 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen female
 5 Adelie  Torgersen male  
 6 Adelie  Torgersen female
 7 Adelie  Torgersen male  
 8 Adelie  Torgersen female
 9 Adelie  Torgersen male  
10 Adelie  Torgersen male  
# ℹ 323 more rows

Extracting pattern matches

Extract elements that match a pattern with stringr::str_subset()

Returns elements that match pattern

head(words, n = 20)

 [1] "a"         "able"      "about"     "absolute"  "accept"    "account"  
 [7] "achieve"   "across"    "act"       "active"    "actual"    "add"      
[13] "address"   "admit"     "advertise" "affect"    "afford"    "after"    
[19] "afternoon" "again"

str_subset(words, "^rec")  # select elements starting with "rec"

[1] "receive"   "recent"    "reckon"    "recognize" "recommend" "record"

str_subset(words, "ing$")  # select elements ending with "ing"

[1] "bring"   "during"  "evening" "king"    "meaning" "morning" "ring"   
[8] "sing"    "thing"

Replacing patterns

Replacing pattern matches

Replace matches with new strings with stringr::str_replace() and stringr::str_replace_all()

str_replace(x, "[aeiou]", "-")  # replace only first instance of match

[1] "-pple"  "b-nana" "p-ar"   NA

str_replace_all(x, "[aeiou]", "-")  # replace all matches

[1] "-ppl-"  "b-n-n-" "p--r"   NA

str_replace_all(x, "[^aeiou]", "-")  # replace all matches

[1] "a---e"  "-a-a-a" "-ea-"   NA

How do we do this based on position instead of pattern?

Replacing pattern matches

You can use this to recode character variables, but…

set.seed(50)
penguins |>
  mutate(new_island = str_replace(island, "Torgersen", "Party")) |> 
  select(species, island, new_island) |> 
  slice_sample(n = 6)

# A tibble: 6 × 3
  species   island    new_island
  <fct>     <fct>     <chr>     
1 Adelie    Torgersen Party     
2 Chinstrap Dream     Dream     
3 Adelie    Dream     Dream     
4 Chinstrap Dream     Dream     
5 Gentoo    Biscoe    Biscoe    
6 Gentoo    Biscoe    Biscoe

It coerces to character data types

I use this A LOT to clean up text data

Replacing `NA`

Replace NA with another value with dplyr::replace_na()

[1] "apple"  "banana" "pear"   NA

replace_na(x, "Missing")

[1] "apple"   "banana"  "pear"    "Missing"

Splitting strings

Split a string up into pieces with str_split()

head(sentences, n = 2)

[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."

sentences |>
  head(2) |>
  str_split(" ")

[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

Notice this produces a list. Why?

Splitting strings

Convert to matrix with simplify

sentences[c(1:2, 5)] |>
  str_split(" ", simplify = TRUE)

     [,1]   [,2]    [,3]    [,4]     [,5]  [,6]    [,7]     [,8]         
[1,] "The"  "birch" "canoe" "slid"   "on"  "the"   "smooth" "planks."    
[2,] "Glue" "the"   "sheet" "to"     "the" "dark"  "blue"   "background."
[3,] "Rice" "is"    "often" "served" "in"  "round" "bowls." ""

Solving the problem

library(tidyverse)
nrows <- 12
set.seed(100)
data1 <- tibble(time = rep(c("early-day", "mid-day", "late-day", "daytime"), 
                           times = 3), 
                species = sample(c("dog", "dogfish", "bear dog", "cat", "catfish"), 
                                 nrows, replace = TRUE), 
                resp = sample(c("yes", "no", "yes", "no", NA), nrows, 
                              replace = TRUE))

Solving the problem

data1

# A tibble: 12 × 3
   time      species  resp 
   <chr>     <chr>    <chr>
 1 early-day dogfish  yes  
 2 mid-day   bear dog no   
 3 late-day  dog      yes  
 4 daytime   dogfish  no   
 5 early-day cat      yes  
 6 mid-day   cat      no   
 7 late-day  dogfish  no   
 8 daytime   bear dog no   
 9 early-day dogfish  <NA> 
10 mid-day   catfish  yes  
11 late-day  cat      yes  
12 daytime   bear dog yes

data2

# A tibble: 8 × 3
  time      species  resp   
  <chr>     <chr>    <chr>  
1 early-Day dogfish  yes    
2 mid-Day   bear dog no     
3 late-Day  dog      yes    
4 daytime   dogfish  no     
5 late-Day  dogfish  no     
6 daytime   bear dog no     
7 early-Day dogfish  no data
8 daytime   bear dog yes

Let’s code!

Matching patterns

Matching patterns

Introduction

The problem

Set-up

Mental model

Strings with {stringr}

Patterns

Regular expressions

Regular expressions

Matching strings

Regex 101

Regex 101

Regex 101

Regex 101

Regex 101

Regex 101

Detecting and extracting patterns

Detecting pattern matches

Extracting pattern matches

Extracting pattern matches

Replacing patterns

Replacing pattern matches

Replacing pattern matches

Replacing NA

Splitting strings

Splitting strings

Splitting strings

Solving the problem

Solving the problem

Let’s code!

Replacing `NA`