[SOLVED] r large data.table why is extracting a word with regex faster than stringr::word?


I have a large data.table with over 7 million rows and 38 columns. One of the columns is a character vector, which contains a long descriptive sentence. I know that the first word of each sentence is a category and the second word is a name, both of which I need to put in two new columns for later analysis.

This probably won’t illustrate the time differences too well as it is too small (actually system.time() on this example gives 0), but here is a toy character string to illustrate what I’m trying to do:

# Load libraries:

# Create example character string:
x <- c("spicy apple cream", "mild peach melba", "juicy strawberry tart")
id <- c(1,2,3)

# Create dt:
mydt <- data.table(id = id, desert = x)

Supposing as in my real data, I want to extract the first word from each string, and put it in a new variable called category, then the second word from each string and put it in a new variable called fruit_name.

The lexically simplest way seems to be to use stringr::word() which is appealing because it avoids the need for working out complicated regular expressions:

# Add a new category column:
mydt[, category := stringr::word(desert, 1)]

# Add a new fruit name column:
mydt[, fruit_name := stringr::word(desert, 2)]

While this works fine on small data sets, on my real data set it was taking forever (and I suspect was hanging although I killed it and restarted R after 10 minutes). For context, other character vector type operations in this data set were taking approximately 20 seconds to run, so it seems there is something especially labour intensive and computationally consuming about this function.

In contrast, if I use a regular expression with sub() it doesn’t hang and seems to work at about the same speed as other character vector operations:

# Create category column with regex:
mydt[, category := sub("(^\\w+).*", "\\1", desert)]

# Create fruit name column with regex:
mydt[, fruit_name := sub("^\\w+\\s+(\\w+).*", "\\1", desert)]

Can anyone shed any light on the speed differences between these two approaches? Interestingly, even with this toy example, running system.time() with stringr::word() hangs for a couple of seconds before giving the result, but that could just be because my real (large) data set is loaded in my environment.

Is stringr::word() somehow breaking the data.table convention of replacing by reference (creating a new column without copying the whole table)? Somehow I thought sub() would be worse for this as it is presumably copying the whole string and then replacing with the bit that matches the regex pattern, but it is in fact much faster.

Any insights much appreciated!

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] officer_0.4.1     flextable_0.6.9   data.table_1.14.2 lubridate_1.8.0  
 [5] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4      
 [9] readr_2.1.0       tidyr_1.1.4       tibble_3.1.6      ggplot2_3.3.5    
[13] tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] xfun_0.28         tidyselect_1.1.1  haven_2.4.3       colorspace_2.0-2 
 [5] vctrs_0.3.8       generics_0.1.1    htmltools_0.5.2   base64enc_0.1-3  
 [9] utf8_1.2.2        rlang_0.4.12      pillar_1.6.4      glue_1.5.0       
[13] withr_2.4.2       DBI_1.1.1         gdtools_0.2.3     dbplyr_2.1.1     
[17] uuid_1.0-3        modelr_0.1.8      readxl_1.3.1      lifecycle_1.0.1  
[21] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0  zip_2.2.0        
[25] rvest_1.0.2       evaluate_0.14     knitr_1.36        fastmap_1.1.0    
[29] tzdb_0.2.0        fansi_0.5.0       broom_0.7.10      Rcpp_1.0.7       
[33] scales_1.1.1      backports_1.3.0   jsonlite_1.7.2    fs_1.5.0         
[37] systemfonts_1.0.3 digest_0.6.28     hms_1.1.1         stringi_1.7.5    
[41] grid_4.1.2        cli_3.1.0         tools_4.1.2       magrittr_2.0.1   
[45] crayon_1.4.2      pkgconfig_2.0.3   ellipsis_0.3.2    xml2_1.3.2       
[49] reprex_2.0.1      rmarkdown_2.11    assertthat_0.2.1  httr_1.4.2       
[53] rstudioapi_0.13   R6_2.5.1          compiler_4.1.2   


This isn’t linked to data.table.

sub relies on internal C code call:

function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
  fixed = FALSE, useBytes = FALSE) 
  if (is.factor(x) && length(levels(x)) < length(x)) {
    sub(pattern, replacement, levels(x), ignore.case, perl, 
      fixed, useBytes)[x]
  else {
    if (!is.character(x)) 
      x <- as.character(x)
    .Internal(sub(as.character(pattern), as.character(replacement), 
      x, ignore.case, perl, fixed, useBytes))

whereas stringr::word relies on multiple lapply/vapply/mapply calls :

function (string, start = 1L, end = start, sep = fixed(" ")) 
  n <- max(length(string), length(start), length(end))
  string <- rep(string, length.out = n)
  start <- rep(start, length.out = n)
  end <- rep(end, length.out = n)
  breaks <- str_locate_all(string, sep)
  words <- lapply(breaks, invert_match)
  len <- vapply(words, nrow, integer(1))
  neg_start <- !is.na(start) & start < 0L
  start[neg_start] <- start[neg_start] + len[neg_start] + 
  neg_end <- !is.na(end) & end < 0L
  end[neg_end] <- end[neg_end] + len[neg_end] + 1L
  start[start > len] <- NA
  end[end > len] <- NA
  starts <- mapply(function(word, loc) word[loc, "start"], 
    words, start)
  ends <- mapply(function(word, loc) word[loc, "end"], words, 
  str_sub(string, starts, ends)

For a single string, there’s not much difference:

desert <-"spicy apple cream"
  stringr::word(desert, 1),
  sub("(^\\w+).*", "\\1", desert))

Unit: microseconds
                                expr  min    lq   mean median     uq   max neval
            stringr::word(desert, 1) 50.3 58.35 95.816  71.80 115.35 323.8   100
 sub("(^\\\\w+).*", "\\\\1", desert) 46.3 51.05 68.810  53.85  63.20 265.1   100

But if you replicate 10^6 times, sub is 20 times faster :

desert <- rep("spicy apple cream",10^6)
  stringr::word(desert, 1),
  sub("(^\\w+).*", "\\1", desert),times=5)

Unit: milliseconds
                                expr        min        lq       mean     median         uq        max
            stringr::word(desert, 1) 11605.1720 13724.731 14484.9069 14043.3454 16066.1067 16985.1798
 sub("(^\\\\w+).*", "\\\\1", desert)   696.2793   752.516   771.5857   797.5788   803.7969   807.7577

Answered By – Waldi

Answer Checked By – Robin (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *