I have a column with string that contains delimiters and I would like to create a function to extract substring only for the string that contains the delimiters
EMAIL TITLE [email protected] Marketing Analyst [email protected] 501.Software Engineer.MG3 [email protected] Product Researcher [email protected] Managing Director [email protected] 64.Legal Consultant.I44 [email protected] Hardware Analyst.
I would like to extract the substring in between the "." delimiters only for the string with delimiters. Else, the text should remain the same.
EMAIL TITLE NEW_TITLE [email protected] Marketing Analyst Marketing Analyst [email protected] 501.Software Engineer.MG3 Software Engineer [email protected] Product Researcher Product Researcher [email protected] Managing Director Managing Director [email protected] 64.Legal Consultant.I44 Legal Consultant [email protected] Hardware Analyst. Hardware Analyst.
I have tried to create a function with the following code but it does not seem to be working
def clean_title(text): match = re.search(r"\.(.*?)\.", text) if match: return match.group(1) else: return text df['NEW_TITLE'] = df['TITLE'].apply(clean_title)
appreciate any form of help, thank you!
You can use a replacing approach:
df['NEW_TITLE'] = df['TITLE'].str.replace(r'^[^.]*\.([^.]+)\..*', r'\1', regex=True)
See the regex demo. The regex matches all occurrences of
^– start of string
[^.]*– zero or more non-dot chars
\.– a dot
([^.]+)– Group 1: one or more non-dot chars
\.– a dot
.*– the rest of the line (any zero or more chars other than line break chars as many as possible)
And replaces with Group 1 value.
Answered By – Wiktor Stribiżew
Answer Checked By – David Marino (BugsFixing Volunteer)