[SOLVED] How create specific dummy variable using regular expression?

Issue

I have a pandas dataframe:

col1
johns id is 81245678316
eric bought 82241624316 yesterday
mine is87721624316
frank is a genius
i accepted new 82891224316again

I want to create new column with dummy variables (0,1) depending on col1. If there is 11 numbers starting with 8 and going in a row, than it must be 1, otherwise 0.

So I wrote this code:

df["is_number"] = df.col1.str.contains(r"\b8\d{10}").map({True: 1, False: 0})

However output is:

col1                                         is_number
johns id is 81245678316                        1
eric bought 82241624316 yesterday              1
mine is87721624316                             0
frank is a genius                              0
i accepted new 82891224316again                0      

as you see third and fifth rows have 0 in "is_number", but I want them to have 1, even though space is missing there between words and numbers in some places. How to do that? I want:

col1                                         is_number
johns id is 81245678316                        1
eric bought 82241624316 yesterday              1
mine is87721624316                             1
frank is a genius                              0
i accepted new 82891224316again                1      

Solution

You can use numeric boundaries as the numbers in your input can be "glued" to letters (that are word boundaries and thus there is no word boundary between the letters and 8):

df["is_number"] = df['col1'].str.contains(r"(?<!\d)8\d{10}(?!\d)").map({True: 1, False: 0})

Output:

>>> df
                                col1  is_number
0            johns id is 81245678316          1
1  eric bought 82241624316 yesterday          1
2                 mine is87721624316          1
3                  frank is a genius          0
4    i accepted new 82891224316again          1

See the regex demo.

Answered By – Wiktor Stribiżew

Answer Checked By – Pedro (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published.