[SOLVED] Function to create new column with substring based on regex condition of another string column – python

Issue

I have a string column and would like to create a function to extract parts of the string based on some conditions of the original string column

EMAIL               NUM_ID    
[email protected]       a9927345A
[email protected]     B2722144X 
[email protected]   A2822876H 
[email protected]    b6722111A
[email protected]      X8923314X

Would like to create NEW_NUM_ID based on the first letter of NUM_ID.
If NUM_ID has "A" or "a" for the first alphabet, then the NEW_NUM_ID would pick the last 5 characters
If NUM_ID has "B" or "b" for the first alphabet, then the NEW_NUM_ID would be the original string
If NUM_ID has "X or "x" for the first alphabet, then the NEW_NUM_ID would be the original string

EMAIL               NUM_ID        NEW_NUM_ID
[email protected]       a9927345A     7345A 
[email protected]     B2722144X     B2722144X
[email protected]   A2822876H     2876H
[email protected]    b6722111A     B6722111A
[email protected]      X8923314X     X8923314X

I have created the following code but can’t seem to get it.

#Function to create the NEW_NUM_ID

def create_new_id(number_id):
    match = re.findall(r'^[a-zA-Z].*', number_id)
    if match[0] == 'A':
        return number_id[-5:]
    elif match[0] == 'B':
        return number_id
    elif match[0] == 'X':
        return number_id
    else:
        return 'NA'

df['NEW_NUM_ID'] = df['NUM_ID'].apply(create_new_id)

Appreciate any form of help I can get, thank you.

Solution

You can use a str.replace like

df["NEW_NUM_ID"] = df["NUM_ID"].str.replace(r'(?i)^a.*(\S{5})\s*$', r'\1', regex=True)

See the regex demo.

If there can be leading whitespace, use a (?i)^\s*a.*(\S{5})\s*$ as regex. Details:

  • (?i) – case insensitive search and replace is enabled
  • ^ – start of string
  • a – an a or A
  • .* – any zero or more chars other than line break chars as many as possible
  • (\S{5}) – Group 1: five non-whitespace chars
  • \s* – zero or more whitespaces
  • $ – end of string.

The replacement is the backreference to Group 1 value.

Answered By – Wiktor Stribiżew

Answer Checked By – Timothy Miller (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *