[SOLVED] Function to extract substring from a string with multiple delimiters – python


I have a column with string that contains delimiters and I would like to create a function to extract substring only for the string that contains the delimiters


EMAIL               TITLE
[email protected]   Marketing Analyst
[email protected]     501.Software Engineer.MG3 
[email protected]     Product Researcher
[email protected]    Managing Director
[email protected]    64.Legal Consultant.I44
[email protected]    Hardware Analyst.

I would like to extract the substring in between the "." delimiters only for the string with delimiters. Else, the text should remain the same.

EMAIL               TITLE                       NEW_TITLE
[email protected]   Marketing Analyst           Marketing Analyst
[email protected]     501.Software Engineer.MG3   Software Engineer
[email protected]     Product Researcher          Product Researcher
[email protected]    Managing Director           Managing Director 
[email protected]    64.Legal Consultant.I44     Legal Consultant
[email protected]    Hardware Analyst.           Hardware Analyst.

I have tried to create a function with the following code but it does not seem to be working

def clean_title(text):
    match = re.search(r"\.(.*?)\.", text)
    if match:
        return match.group(1)
        return text

df['NEW_TITLE'] = df['TITLE'].apply(clean_title)

appreciate any form of help, thank you!


You can use a replacing approach:

df['NEW_TITLE'] = df['TITLE'].str.replace(r'^[^.]*\.([^.]+)\..*', r'\1', regex=True)

See the regex demo. The regex matches all occurrences of

  • ^ – start of string
  • [^.]* – zero or more non-dot chars
  • \. – a dot
  • ([^.]+) – Group 1: one or more non-dot chars
  • \. – a dot
  • .* – the rest of the line (any zero or more chars other than line break chars as many as possible)

And replaces with Group 1 value.

Answered By – Wiktor Stribi┼╝ew

