Issue
I have a pandas dataframe:
col1
johns id is 81245678316
eric bought 82241624316 yesterday
mine is87721624316
frank is a genius
i accepted new 82891224316again
I want to create new column with dummy variables (0,1) depending on col1. If there is 11 numbers starting with 8 and going in a row, than it must be 1, otherwise 0.
So I wrote this code:
df["is_number"] = df.col1.str.contains(r"\b8\d{10}").map({True: 1, False: 0})
However output is:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 0
frank is a genius 0
i accepted new 82891224316again 0
as you see third and fifth rows have 0 in "is_number", but I want them to have 1, even though space is missing there between words and numbers in some places. How to do that? I want:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 1
frank is a genius 0
i accepted new 82891224316again 1
Solution
You can use numeric boundaries as the numbers in your input can be "glued" to letters (that are word boundaries and thus there is no word boundary between the letters and 8
):
df["is_number"] = df['col1'].str.contains(r"(?<!\d)8\d{10}(?!\d)").map({True: 1, False: 0})
Output:
>>> df
col1 is_number
0 johns id is 81245678316 1
1 eric bought 82241624316 yesterday 1
2 mine is87721624316 1
3 frank is a genius 0
4 i accepted new 82891224316again 1
See the regex demo.
Answered By – Wiktor Stribiżew
Answer Checked By – Pedro (BugsFixing Volunteer)