[SOLVED] Applying a function to a dataframe where another dataframe contains the criteria for filtering and the number of records to filter

Issue

I searched over the weekend for a solution to this problem but was unable to find one. I did end up making a script which I believe is much longer than it needs to be and surely there is a quicker way with a loop or function.

I would like to create a function that loops through each row of data frame 1 (which has been derived from a frequency table). The function will use filter() and sample_n() to select records from data frame 2. Therefore, data frame 1 will act as criteria for filtering and sampling for data frame 2.

Please see my code below which does not return the records I am searching for.

The correct result would be to return 1 record from group A (1910), 3 from group A (1930), 2 from group B (1930), 1 from group C (1940), 1 from group D (1940) and 2 from group C (1930) at random.

Cheers,

Daniel

require(dplyr)

FilterA <- c("A","B","A","B","C","D","C")
FilterB <- c(1910,1920,1930,1930,1940,1940,1930)
Frequency <- c(1,0,3,2,1,1,2)

df1 <- data.frame(FilterA, FilterB, Frequency)

df1$num <- paste(FilterA, FilterB, sep=" ")

ID <- c("A", "A", "A", "A", "A", "A", "A", "A",
        "B", "B", "B", "B",
        "C", "C", "C", "C", "C", "C",
        "D", "D", "D")

Year <- c(1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
          1930, 1920, 1930, 1930,
          1910, 1940, 1910, 1910, 1930, 1930,
          1930, 1940, 1940)

df2 <- data.frame(ID, Year)

case.control <- function(datF1, datF2, na.rm=TRUE, ...){
  
  ID_list <- unique(datF1$num)
  
  for (i in seq_along(ID_list)){
    
    func <- filter(datF2, Year == datF1$FilterB & ID == datF1$FilterA) %>% sample_n(datF1$Frequency)

    func
    
    }
  
}

x <- case.control(df1, df2)

Solution

Thanks for that reproducible example – if you did want to go row-by-row:

First, I’ve tidied some of your code, there are no real changes here:

# Your code----
id <- c("A", "B", "A", "B", "C", "D", "C")
year <- c(1910, 1920, 1930, 1930, 1940, 1940, 1930)
frequency <- c(1, 0, 3, 2, 1, 1, 2)

df_1 <- data.frame(id,
  year,
  frequency,
  row.names = NULL
)

df_1$num <- paste(id, year)

df_1 <- df_1 %>%
  filter(frequency != 0)

id <- c(
  "A", "A", "A", "A", "A", "A", "A", "A",
  "B", "B", "B", "B",
  "C", "C", "C", "C", "C", "C",
  "D", "D", "D"
)

year <- c(
  1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
  1930, 1920, 1930, 1930,
  1910, 1940, 1910, 1910, 1930, 1930,
  1930, 1940, 1940
)

df_2 <- data.frame(id, year)

Now, moving on to actually obtaining random samples, using lapply(). You could use apply() to iterate over rows of a dataframe, but I personally find working with apply() confusing, so instead I’m turning df_1 into a list, making each row an object in it.

library(dplyr)
list_1 <- split(df_1, seq(nrow(df_1)))

I then use lapply() to iterate over each row with the following function:

# Option 1: lapply()----
random_records <- lapply(list_1, function(x) {
  df_records <- df_2 %>%
   # Matching up the years and id in df_2
    filter(year == x$year & id == x$id) %>%
    # Using the frequency with slice_sample(), sample_n() is also fine
    slice_sample(n = x$frequency)
})

# Then bind the list back together again into a dataframe
random_records <- bind_rows(random_records)

Or, another option that personally I prefer is using purrr‘s map_df(), because it returns a dataframe straight away.

# Option 2: purrr's map_df()
# I think this option is the neatest, because it returns a df immediately
library(purrr)
random_records <- map_df(list_1, function(x) {
  df_records <- df_2 %>%
    filter(year == x$year & id == x$id) %>%
    slice_sample(n = x$frequency)
})

Answered By – ThomasZ

Answer Checked By – Pedro (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *