Issue
I have a column with string that contains delimiters and I would like to create a function to extract substring only for the string that contains the delimiters
Current
EMAIL TITLE
[email protected] Marketing Analyst
[email protected] 501.Software Engineer.MG3
[email protected] Product Researcher
[email protected] Managing Director
[email protected] 64.Legal Consultant.I44
[email protected] Hardware Analyst.
I would like to extract the substring in between the "." delimiters only for the string with delimiters. Else, the text should remain the same.
EMAIL TITLE NEW_TITLE
[email protected] Marketing Analyst Marketing Analyst
[email protected] 501.Software Engineer.MG3 Software Engineer
[email protected] Product Researcher Product Researcher
[email protected] Managing Director Managing Director
[email protected] 64.Legal Consultant.I44 Legal Consultant
[email protected] Hardware Analyst. Hardware Analyst.
I have tried to create a function with the following code but it does not seem to be working
def clean_title(text):
match = re.search(r"\.(.*?)\.", text)
if match:
return match.group(1)
else:
return text
df['NEW_TITLE'] = df['TITLE'].apply(clean_title)
appreciate any form of help, thank you!
Solution
You can use a replacing approach:
df['NEW_TITLE'] = df['TITLE'].str.replace(r'^[^.]*\.([^.]+)\..*', r'\1', regex=True)
See the regex demo. The regex matches all occurrences of
^
– start of string[^.]*
– zero or more non-dot chars\.
– a dot([^.]+)
– Group 1: one or more non-dot chars\.
– a dot.*
– the rest of the line (any zero or more chars other than line break chars as many as possible)
And replaces with Group 1 value.
Answered By – Wiktor Stribiżew
Answer Checked By – David Marino (BugsFixing Volunteer)