I have two very large dataframes containing names of people. The two dataframes report different information on these people (i.e. df1 reports data on health status and df2 on socio-economic status). A subset of people appears in both dataframes. This is the sample I am interested in.
I would need to create a new dataframe which includes only those people appearing in both datasets. There are, however, small differences in the names, mostly due to typos.
My data is as follows:
df1 name | smoker | age "Joe Smith" | Yes | 43 "Michael Fagin" | Yes | 35 "Ellen McFarlan" | No | 55 ... ...
df2 name | occupation | location "Joe Smit" | Postdoc | London "Joan Evans" | IT consultant | Bristol "Michael Fegin" | Lawyer | Liverpool ... ...
What I would need is to have a third dataframe df3 with the following information:
df3 name1 | name2 | distance | smoker | age | occupation | location "Joe Smith" | "Joe Smit" | a measure of their Jaro distance | Yes | 43 | Postdoc | London "Michael Fagin" | "Michael Fegin" | a measure of their Jaro distance | Yes | 35 | Lawyer | Liverpool ... ...
So far I have worked with the stringdist package to get a vector of possible matches, but I am struggling to use this information to create a new dataframe with the information I need. Many thanks in advance should anyone have an idea for this!
library(tidyverse) library(fuzzyjoin) df1 <- tibble( name = c("Joe Smith", "Michael Fagin"), smoker = c("yes", "yes") ) df2 <- tibble( name = c("Joe Smit", "Michael Fegin"), occupation = c("post doc", "IT consultant") ) df1 %>% # max 3 chars different stringdist_inner_join(df2, max_dist = 3) #> Joining by: "name" #> # A tibble: 2 × 4 #> name.x smoker name.y occupation #> <chr> <chr> <chr> <chr> #> 1 Joe Smith yes Joe Smit post doc #> 2 Michael Fagin yes Michael Fegin IT consultant
Created on 2022-03-01 by the reprex package (v2.0.0)
Answered By – danlooo
Answer Checked By – Willingham (BugsFixing Volunteer)