Issue
I have a table with people and their hometown names, but there are same cities with different written, see:
Name | Hometown |
---|---|
João | São Paulo |
Maria | Sao Paul |
Pedro | São Paulo. |
Maria | S. Paulo |
And I need to process this in order to formalize that data to be like this:
Name | Hometown |
---|---|
João | São Paulo |
Maria | São Paulo |
Pedro | São Paulo |
Maria | São Paulo |
- The dataset has more than 2400 distinct values so I can’t hard code.
- I have a Country table dimension with all cities and their correct names.
I tried this stack and would it be exactly what I need but does not work with my entire dataset.
Solution
Consider below approach (considering you have lookup table with all proper cities names) for purpose of example – I have it as CTE with just few ones
with cities as (
select 'São Paulo' as city union all
select 'Los Angeles' union all
select 'Dnipro' union all
select 'Kyiv'
)
select Name, City as Hometown
from your_table
left join cities
on soundex(Hometown) = soundex(city)
if applied to sample data in your question – output is
Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help …
Answered By – Mikhail Berlyant
Answer Checked By – Terry (BugsFixing Volunteer)