[SOLVED] Turning a subset based on multiple parameters into a looping function

Issue

I have a dataframe with 3000+ items (each row is a new item with the item_id in the first column) with 2500+ attributes in each subsequent column.

item_id    attribute_1    attribute_2 ... category_1 ... category_2
x3902      1              5               1              0
x1402      2              0               1              1

I am able to successfully grab a random list of 10 items that fulfil a given attribute and category:

subset_of_items = sample(subset(df$item_id, (df$attribute1 == 3 & df$category_2 == 1 )),10)

I need to repeat this for 30+ combinations of attribute + category. So I was hoping to create a function that loops and saves the output into a table where each table column is a list of 10 items that fulfils a set of conditions.

This is as far as I could get, but can’t get it to successfully run:

df <- read.csv ("data.csv",sep=",")

attribute <- c("attribute_1", "attribute_2", "attribute_3")
value <- c(1, 2, 5)
category <- c("category_1", "category_1", "category_2")

function1 <- function(x, attribute_x, value_x, category_x)
{
  dftemp = subset(x[["item_id"]], (x[[attribute_x]] == value_x & x[[category_x]] == 1 ))
  output <- cbind(output, dftemp)
}

function1(df, attribute, value, category)

The cherry on the cake would be to name each of the columns as attribute_x .

Solution

You can do the following (this uses a data.table approach)

  1. Set your df to data.table
library(data.table)
df <- read.csv("data.csv",sep=",")
setDT(df)
  1. Make a revised function, called get_subset() that returns a vector of item_ids that meet the attribute and category requirements, using get(). It
    also checks if you have size items return, and pads with as.character(NA) to ensure that the returned vector has length = size
get_subset <- function(x, attribute_x, value_x, category_x, size=10)
{
  subset = x[get(attribute_x)==value_x & get(category_x) == 1, .(item_id)] %>% 
    .[sample(1:.N, size=min(size, .N)), item_id]
  if(length(subset)<size) subset=c(subset,rep(as.character(NA),size-length(subset)))
  return(subset)
}
  1. Put your combinations of interest into a small data.table
combinations = data.table(
  attribute = c("attribute_1", "attribute_2"),
  value = c(1, 5),
  category = c("category_1", "category_2")
)
  1. use lapply, to loop over the rows of your combinations, each time feeding the attribute, value, and category value of the combination to the get_subset function
result = lapply(seq(1,nrow(combinations)), function(i) {
    items = data.table(get_subset(
      x=df,
      attribute_x = combinations[i,attribute],
      value_x = combinations[i,value],
      category_x = combinations[i, category],
      size=10))
    setNames(items, combinations[i,attribute])
})

That returns a list of data.tables with a single column, that is named after the attribute for that contribution

  1. Use do.call with cbind to bind them into a single data.table
result = do.call(cbind, result)

If this is your original df

df = data.table(
  item_id = paste0("x", stringr::str_pad(sample(1:1000,100),width = 4,side = "left",pad = "0")),
  attribute_1=sample(1:10, 100, replace=T),
  attribute_2=sample(1:10, 100, replace=T),
  category_1=sample(c(0,1),100, replace=T),
  category_2=sample(c(0,1),100, replace=T)
)

then the above combinations, will return something like the following in result (Note, that my df example uses sample without seed, so your results will differ)

    attribute_1 attribute_2
 1:       x7511       x7302
 2:       x0325       x0956
 3:       x6748       x1893
 4:       x4462       x6958
 5:       x4146        <NA>
 6:       x5407        <NA>
 7:        <NA>        <NA>
 8:        <NA>        <NA>
 9:        <NA>        <NA>
10:        <NA>        <NA>

Answered By – langtang

Answer Checked By – Jay B. (BugsFixing Admin)

Leave a Reply

Your email address will not be published.