[SOLVED] pandas performance:text column replacement is slow

Issue

I have a large dataset with 250,000 entries, and the text column that i am processing contains a sentence is each row.

import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')
from faker import Faker
fake = Faker()

df = pd.read_csv('my/huge/dataset.csv')
(e,g) -->  df = pd.DataFrame({'text':['Michael Jackson was a famous singer and songwriter.']})

so from text file, I am trying to find names of people and replace them with fake names from the faker library and adding the result to a new column, as follows.

person_list = [[n.text for n in doc.ents] for doc in nlp_news_sm.pipe(df.text.values) if [n.label_ == 'PER' for n in doc.ents]]
flat_person_list = list(set([item for sublist in person_list for item in sublist]))
fake_person_name = [fake.name() for n in range(len(flat_person_list))]
name_dict = dict(zip(flat_person_list, fake_person_name))

df.name = df.text.replace(name_dict, regex=True)

The problem is that it is taking forever to run and I am not sure how to enhance the performance of the code, so it can run faster.

Solution

ok i think i found a better way of doing text replacement in pandas, thanks to Florian C‘s comment.
The Spacy model still takes a lot of time, but that part I cannot change, however, instead of str.replace, i decided to use map and lambda, so now the last line is as follows:

df.name = df.text.map(lambda x:name_dict.get(x,x))

Answered By – zara kolagar

Answer Checked By – Marilyn (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *