[SOLVED] fast replacement of data.table values by labels stored in another data.table


It is related to this question and this other one, although to a larger scale.
I have two data.tables:

  • The first one with market research data, containing answers stored as integers;
  • The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.

See reproducible example :

EDIT: Addition of a new variable to include the ‘0’ case.

EDIT 2: Modification of ‘age_group’ variable to include cases where all unique levels of a factor do not appear in data.


# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
  country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
  gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
  age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
  age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
  status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
  children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))

# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
  label_id = as.character(c(1:9)),
  country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
  gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
  age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
  status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
  children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))

# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")

# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]

You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.

I have actually achieved it using a loop, like the one below:

result_DT1 <- copy(repex_DT) 
for (x in categorical_var){
    values_vector <- labels_DT$label_id
    labels_vector <- labels_DT[[x]]
    values_vector <- c("0",labels_DT$label_id)
    labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
  result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]

What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:

test_stringi <- copy(repex_DT) %>% 
  .[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
    str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
    replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
    .SDcols = c("country")]


I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by @det, it is not possible to consider variables with a starting ‘0’ label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.

The answer below:


#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
  which(intersect(names(repex_DT), names(labels_DT)) %fin% 
          names(repex_DT)[which(unlist(lapply(repex_DT, function(x) 
            sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),

same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
  which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]

labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0   <- labels_DT[, same_cols_with0, with=FALSE]
levels_id  <- as.integer(labels_DT$label_id)

#Update joins via matching IDs (credit to @det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>% 
  .[, (same_cols_standard) := mapply(
    function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
    repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>% 
  .[, (same_cols_with0) := mapply(
    function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
    repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

Answered By – Maxence Dum.

Answer Checked By – Willingham (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *