Table of Contents
Issue
I have a DataFrame that looks like this:
import pandas as pd
rows = [
('chocolate', 'choco'),
('banana', pd.np.nan),
('hello world', 'world'),
('hello you', 'world'),
('hello you choco', 'world'),
('this is a very long sentence', 'very long')
]
data = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])
origin to_find
0 chocolate choco
1 banana NaN
2 hello world world
3 hello you world
4 hello you choco world
5 this is a very long sentence very long
My goal is to search the second column string within the first column, and remove it. If I do not find the substring to_find
within origin
, I replace the to_find
with a NaN
. Because this is a string operation that needs to be done row by row, I chose the apply
way. I need to extract entire words.
This is my function, that works almost* as expected, and how I apply
it:
def find_word(row):
# Handle the case where to_find is already NaN
if row.to_find is pd.np.nan:
return row
if row.to_find in row.origin:
row.origin = row.origin.replace(row.to_find, '').strip()
else:
row.to_find = pd.np.nan
return row
new_df = data.apply(find_word, axis=1)
* this code returns two spaces instead of one between this is a
and sentence
, which is not desirable.
The new_df
is expected to look like this:
origin to_find
0 late choco
1 banana NaN
2 hello world
3 hello you NaN
4 hello you choco NaN
5 this is a sentence very long
My problem is, that my original df
has millions of lines, and this particular operation takes forever with a huge DataFrame.
What would be a more performant, maybe vectorized way to tackle this?
(The .contains
method seems to work only to find one specific string in a vector, not pairwise. It was my best lead but couldn’t make it work.)
Solution
Update
Reading this thread and this one, I managed to ridiculously reduced the process time using list comprehensions. Here comes method_3
:
def method_3(df):
df["to_find"] = df["to_find"].fillna('')
df['temp_origin'] = df['origin'].copy()
df['origin'] = [' '.join([x for x in a.split() if x not in set(b.split())]) for a, b in zip(df['origin'], df['to_find'])]
df['temp_origin'] = [' '.join([x for x in a.split(' ') if x not in set(b.split(' '))]) for a, b in zip(df['temp_origin'], df['origin'])]
df['temp_origin'] = df['temp_origin'].replace('', pd.np.nan)
del df['to_find']
df.rename(columns={'temp_origin': 'to_find'}, inplace=True)
return df
Now with new timings:
Method 1 took 13.820100281387568 sec.
Method 2 took 2.89176794141531 sec.
Method 3 took 0.26977075077593327 sec.
The three approaches are O(n)
, but it’s up to 50x faster using method_3
.
Original post
Largely inspired by @sygneto’s answer, I managed to improve speed by almost 5 times.
Two different methods
I put my first method in a function called method_1
and the other in method_2
:
def find_word(row):
if row.to_find is pd.np.nan:
return row
if row.to_find in row.origin:
row.origin = row.origin.replace(row.to_find, '').strip()
else:
row.to_find = pd.np.nan
return row
def method_1(df):
return df.apply(find_word, axis=1)
def method_2(df):
df = df.fillna('')
df['temp_origin'] = df['origin']
df["origin"] = df.apply(lambda x: x["origin"].replace(x["to_find"], ""), axis=1)
df["to_find"] = df.apply(lambda x: pd.np.nan if x["origin"] == (x["temp_origin"]) else x["to_find"], axis=1)
del df['temp_origin']
return df
Measure speed for both methods
To compare the time spent, I took my initial DataFrame and concat
ed it 10000 times:
from timeit import default_timer
df = pd.concat([data] * 10000)
t0 = default_timer()
new_df_1 = method_1(df)
t1 = default_timer()
df = pd.concat([data] * 10000)
t2 = default_timer()
new_df_2 = method_2(df)
t3 = default_timer()
print(f"Method 1 took {t1-t0} sec.")
print(f"Method 2 took {t3-t2} sec.")
which outputs:
Method 1 took 11.803373152390122 sec.
Method 2 took 2.362371975556016 sec.
There is probably some space of improvements, but still a big step has been taken.
Answered By – pierre_loic
Answer Checked By – Mildred Charles (BugsFixing Admin)