Issue
I have a large dataset with 250,000 entries, and the text column that i am processing contains a sentence is each row.
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')
from faker import Faker
fake = Faker()
df = pd.read_csv('my/huge/dataset.csv')
(e,g) --> df = pd.DataFrame({'text':['Michael Jackson was a famous singer and songwriter.']})
so from text file, I am trying to find names of people and replace them with fake names from the faker library and adding the result to a new column, as follows.
person_list = [[n.text for n in doc.ents] for doc in nlp_news_sm.pipe(df.text.values) if [n.label_ == 'PER' for n in doc.ents]]
flat_person_list = list(set([item for sublist in person_list for item in sublist]))
fake_person_name = [fake.name() for n in range(len(flat_person_list))]
name_dict = dict(zip(flat_person_list, fake_person_name))
df.name = df.text.replace(name_dict, regex=True)
The problem is that it is taking forever to run and I am not sure how to enhance the performance of the code, so it can run faster.
Solution
ok i think i found a better way of doing text replacement in pandas, thanks to Florian C‘s comment.
The Spacy model still takes a lot of time, but that part I cannot change, however, instead of str.replace, i decided to use map and lambda, so now the last line is as follows:
df.name = df.text.map(lambda x:name_dict.get(x,x))
Answered By – zara kolagar
Answer Checked By – Marilyn (BugsFixing Volunteer)