Issue
I have a string column and would like to create a function to extract parts of the string based on some conditions of the original string column
EMAIL NUM_ID
[email protected] a9927345A
[email protected] B2722144X
[email protected] A2822876H
[email protected] b6722111A
[email protected] X8923314X
Would like to create NEW_NUM_ID based on the first letter of NUM_ID.
If NUM_ID has "A" or "a" for the first alphabet, then the NEW_NUM_ID would pick the last 5 characters
If NUM_ID has "B" or "b" for the first alphabet, then the NEW_NUM_ID would be the original string
If NUM_ID has "X or "x" for the first alphabet, then the NEW_NUM_ID would be the original string
EMAIL NUM_ID NEW_NUM_ID
[email protected] a9927345A 7345A
[email protected] B2722144X B2722144X
[email protected] A2822876H 2876H
[email protected] b6722111A B6722111A
[email protected] X8923314X X8923314X
I have created the following code but can’t seem to get it.
#Function to create the NEW_NUM_ID
def create_new_id(number_id):
match = re.findall(r'^[a-zA-Z].*', number_id)
if match[0] == 'A':
return number_id[-5:]
elif match[0] == 'B':
return number_id
elif match[0] == 'X':
return number_id
else:
return 'NA'
df['NEW_NUM_ID'] = df['NUM_ID'].apply(create_new_id)
Appreciate any form of help I can get, thank you.
Solution
You can use a str.replace
like
df["NEW_NUM_ID"] = df["NUM_ID"].str.replace(r'(?i)^a.*(\S{5})\s*$', r'\1', regex=True)
See the regex demo.
If there can be leading whitespace, use a (?i)^\s*a.*(\S{5})\s*$
as regex. Details:
(?i)
– case insensitive search and replace is enabled^
– start of stringa
– ana
orA
.*
– any zero or more chars other than line break chars as many as possible(\S{5})
– Group 1: five non-whitespace chars\s*
– zero or more whitespaces$
– end of string.
The replacement is the backreference to Group 1 value.
Answered By – Wiktor Stribiżew
Answer Checked By – Timothy Miller (BugsFixing Admin)