Issue
I have a large data.table with over 7 million rows and 38 columns. One of the columns is a character vector, which contains a long descriptive sentence. I know that the first word of each sentence is a category and the second word is a name, both of which I need to put in two new columns for later analysis.
This probably won’t illustrate the time differences too well as it is too small (actually system.time()
on this example gives 0), but here is a toy character string to illustrate what I’m trying to do:
# Load libraries:
library(data.table)
library(stringr)
# Create example character string:
x <- c("spicy apple cream", "mild peach melba", "juicy strawberry tart")
id <- c(1,2,3)
# Create dt:
mydt <- data.table(id = id, desert = x)
Supposing as in my real data, I want to extract the first word from each string, and put it in a new variable called category, then the second word from each string and put it in a new variable called fruit_name.
The lexically simplest way seems to be to use stringr::word()
which is appealing because it avoids the need for working out complicated regular expressions:
# Add a new category column:
mydt[, category := stringr::word(desert, 1)]
# Add a new fruit name column:
mydt[, fruit_name := stringr::word(desert, 2)]
While this works fine on small data sets, on my real data set it was taking forever (and I suspect was hanging although I killed it and restarted R after 10 minutes). For context, other character vector type operations in this data set were taking approximately 20 seconds to run, so it seems there is something especially labour intensive and computationally consuming about this function.
In contrast, if I use a regular expression with sub()
it doesn’t hang and seems to work at about the same speed as other character vector operations:
# Create category column with regex:
mydt[, category := sub("(^\\w+).*", "\\1", desert)]
# Create fruit name column with regex:
mydt[, fruit_name := sub("^\\w+\\s+(\\w+).*", "\\1", desert)]
Can anyone shed any light on the speed differences between these two approaches? Interestingly, even with this toy example, running system.time()
with stringr::word()
hangs for a couple of seconds before giving the result, but that could just be because my real (large) data set is loaded in my environment.
Is stringr::word()
somehow breaking the data.table convention of replacing by reference (creating a new column without copying the whole table)? Somehow I thought sub()
would be worse for this as it is presumably copying the whole string and then replacing with the bit that matches the regex pattern, but it is in fact much faster.
Any insights much appreciated!
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] officer_0.4.1 flextable_0.6.9 data.table_1.14.2 lubridate_1.8.0
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[9] readr_2.1.0 tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5
[13] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] xfun_0.28 tidyselect_1.1.1 haven_2.4.3 colorspace_2.0-2
[5] vctrs_0.3.8 generics_0.1.1 htmltools_0.5.2 base64enc_0.1-3
[9] utf8_1.2.2 rlang_0.4.12 pillar_1.6.4 glue_1.5.0
[13] withr_2.4.2 DBI_1.1.1 gdtools_0.2.3 dbplyr_2.1.1
[17] uuid_1.0-3 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.1
[21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 zip_2.2.0
[25] rvest_1.0.2 evaluate_0.14 knitr_1.36 fastmap_1.1.0
[29] tzdb_0.2.0 fansi_0.5.0 broom_0.7.10 Rcpp_1.0.7
[33] scales_1.1.1 backports_1.3.0 jsonlite_1.7.2 fs_1.5.0
[37] systemfonts_1.0.3 digest_0.6.28 hms_1.1.1 stringi_1.7.5
[41] grid_4.1.2 cli_3.1.0 tools_4.1.2 magrittr_2.0.1
[45] crayon_1.4.2 pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.2
[49] reprex_2.0.1 rmarkdown_2.11 assertthat_0.2.1 httr_1.4.2
[53] rstudioapi_0.13 R6_2.5.1 compiler_4.1.2
Solution
This isn’t linked to data.table
.
sub
relies on internal C code call:
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (is.factor(x) && length(levels(x)) < length(x)) {
sub(pattern, replacement, levels(x), ignore.case, perl,
fixed, useBytes)[x]
}
else {
if (!is.character(x))
x <- as.character(x)
.Internal(sub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
}
whereas stringr::word
relies on multiple lapply
/vapply
/mapply
calls :
function (string, start = 1L, end = start, sep = fixed(" "))
{
n <- max(length(string), length(start), length(end))
string <- rep(string, length.out = n)
start <- rep(start, length.out = n)
end <- rep(end, length.out = n)
breaks <- str_locate_all(string, sep)
words <- lapply(breaks, invert_match)
len <- vapply(words, nrow, integer(1))
neg_start <- !is.na(start) & start < 0L
start[neg_start] <- start[neg_start] + len[neg_start] +
1L
neg_end <- !is.na(end) & end < 0L
end[neg_end] <- end[neg_end] + len[neg_end] + 1L
start[start > len] <- NA
end[end > len] <- NA
starts <- mapply(function(word, loc) word[loc, "start"],
words, start)
ends <- mapply(function(word, loc) word[loc, "end"], words,
end)
str_sub(string, starts, ends)
}
For a single string, there’s not much difference:
desert <-"spicy apple cream"
microbenchmark::microbenchmark(
stringr::word(desert, 1),
sub("(^\\w+).*", "\\1", desert))
Unit: microseconds
expr min lq mean median uq max neval
stringr::word(desert, 1) 50.3 58.35 95.816 71.80 115.35 323.8 100
sub("(^\\\\w+).*", "\\\\1", desert) 46.3 51.05 68.810 53.85 63.20 265.1 100
But if you replicate 10^6 times, sub
is 20 times faster :
desert <- rep("spicy apple cream",10^6)
microbenchmark::microbenchmark(
stringr::word(desert, 1),
sub("(^\\w+).*", "\\1", desert),times=5)
Unit: milliseconds
expr min lq mean median uq max
stringr::word(desert, 1) 11605.1720 13724.731 14484.9069 14043.3454 16066.1067 16985.1798
sub("(^\\\\w+).*", "\\\\1", desert) 696.2793 752.516 771.5857 797.5788 803.7969 807.7577
Answered By – Waldi
Answer Checked By – Robin (BugsFixing Admin)