Issue
I searched over the weekend for a solution to this problem but was unable to find one. I did end up making a script which I believe is much longer than it needs to be and surely there is a quicker way with a loop or function.
I would like to create a function that loops through each row of data frame 1 (which has been derived from a frequency table). The function will use filter() and sample_n() to select records from data frame 2. Therefore, data frame 1 will act as criteria for filtering and sampling for data frame 2.
Please see my code below which does not return the records I am searching for.
The correct result would be to return 1 record from group A (1910), 3 from group A (1930), 2 from group B (1930), 1 from group C (1940), 1 from group D (1940) and 2 from group C (1930) at random.
Cheers,
Daniel
require(dplyr)
FilterA <- c("A","B","A","B","C","D","C")
FilterB <- c(1910,1920,1930,1930,1940,1940,1930)
Frequency <- c(1,0,3,2,1,1,2)
df1 <- data.frame(FilterA, FilterB, Frequency)
df1$num <- paste(FilterA, FilterB, sep=" ")
ID <- c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C", "C", "C",
"D", "D", "D")
Year <- c(1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
1930, 1920, 1930, 1930,
1910, 1940, 1910, 1910, 1930, 1930,
1930, 1940, 1940)
df2 <- data.frame(ID, Year)
case.control <- function(datF1, datF2, na.rm=TRUE, ...){
ID_list <- unique(datF1$num)
for (i in seq_along(ID_list)){
func <- filter(datF2, Year == datF1$FilterB & ID == datF1$FilterA) %>% sample_n(datF1$Frequency)
func
}
}
x <- case.control(df1, df2)
Solution
Thanks for that reproducible example – if you did want to go row-by-row:
First, I’ve tidied some of your code, there are no real changes here:
# Your code----
id <- c("A", "B", "A", "B", "C", "D", "C")
year <- c(1910, 1920, 1930, 1930, 1940, 1940, 1930)
frequency <- c(1, 0, 3, 2, 1, 1, 2)
df_1 <- data.frame(id,
year,
frequency,
row.names = NULL
)
df_1$num <- paste(id, year)
df_1 <- df_1 %>%
filter(frequency != 0)
id <- c(
"A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C", "C", "C",
"D", "D", "D"
)
year <- c(
1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
1930, 1920, 1930, 1930,
1910, 1940, 1910, 1910, 1930, 1930,
1930, 1940, 1940
)
df_2 <- data.frame(id, year)
Now, moving on to actually obtaining random samples, using lapply()
. You could use apply()
to iterate over rows of a dataframe, but I personally find working with apply()
confusing, so instead I’m turning df_1
into a list, making each row an object in it.
library(dplyr)
list_1 <- split(df_1, seq(nrow(df_1)))
I then use lapply()
to iterate over each row with the following function:
# Option 1: lapply()----
random_records <- lapply(list_1, function(x) {
df_records <- df_2 %>%
# Matching up the years and id in df_2
filter(year == x$year & id == x$id) %>%
# Using the frequency with slice_sample(), sample_n() is also fine
slice_sample(n = x$frequency)
})
# Then bind the list back together again into a dataframe
random_records <- bind_rows(random_records)
Or, another option that personally I prefer is using purrr
‘s map_df()
, because it returns a dataframe straight away.
# Option 2: purrr's map_df()
# I think this option is the neatest, because it returns a df immediately
library(purrr)
random_records <- map_df(list_1, function(x) {
df_records <- df_2 %>%
filter(year == x$year & id == x$id) %>%
slice_sample(n = x$frequency)
})
Answered By – ThomasZ
Answer Checked By – Pedro (BugsFixing Volunteer)