[SOLVED] Fetch only the column content which has Latin or special characters in DT25 column

Issue

I am trying to reduce a csv file content based on first_name column, when I say reduce, I am trying to filter out only those rows which contain latin characters in it.

my data looks like this,

A_ID    ID_NUMBER    DT25                                  DT45
abcd    0001         Condé and Geoff Shallard   
abcd    555248817    Rändi & John Fay   
abcd    54786        Randy john
abcd    006299       László and Virginia Csernohorszky-Hope 
abcd    000323       Kim Jonh
abcd    01012        Larry will

I am just trying to create a DF with all the rows with SPL/ latin characters in DT25,

output expected is something like:

A_ID    ID_NUMBER    DT25                                  DT45
abcd    0001         Condé and Geoff Shallard   
abcd    555248817    Rändi & John Fay   
abcd    006299       László and Virginia Csernohorszky-Hope 

I tried this,

import string
df = pd.read_csv(filename)
pattern = "^[a-zA-Z-'&.]*$"
alphabet = string.ascii_letters+string.punctuation
#first_name_df = df[~df['DT25'].str.contains(alphabet, na = False)]
first_name_df = df[~df['DT25'].str.contains(pattern, na = False)]
print(first_name_df)

This is again giving me original DF. Can pandas expert help me with this please?

Solution

You can use the regular expression [^\t-\r -~]:

filtered = df[df['DT25'].str.contains('[^\t-\r -~]')]

Output:

>>> filtered
   A_ID  ID_NUMBER                                    DT25
0  abcd          1                Condé and Geoff Shallard
1  abcd  555248817                        Rändi & John Fay
3  abcd       6299  László and Virginia Csernohorszky-Hope

Answered By – richardec

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *