[SOLVED] Finding substring in string within a pandas DataFrame is very slow


I have a DataFrame that looks like this:

import pandas as pd

rows = [
    ('chocolate', 'choco'),
    ('banana', pd.np.nan),
    ('hello world', 'world'),
    ('hello you', 'world'),
    ('hello you choco', 'world'),
    ('this is a very long sentence', 'very long')
data = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])
                         origin    to_find
0                     chocolate      choco
1                        banana        NaN
2                   hello world      world
3                     hello you      world
4               hello you choco      world
5  this is a very long sentence  very long

My goal is to search the second column string within the first column, and remove it. If I do not find the substring to_find within origin, I replace the to_find with a NaN. Because this is a string operation that needs to be done row by row, I chose the apply way. I need to extract entire words.

This is my function, that works almost* as expected, and how I apply it:

def find_word(row):
    # Handle the case where to_find is already NaN
    if row.to_find is pd.np.nan:
        return row

    if row.to_find in row.origin:
        row.origin = row.origin.replace(row.to_find, '').strip()
        row.to_find = pd.np.nan

    return row

new_df = data.apply(find_word, axis=1)

* this code returns two spaces instead of one between this is a and sentence, which is not desirable.

The new_df is expected to look like this:

                origin    to_find
0                 late      choco
1               banana        NaN
2                hello      world
3            hello you        NaN
4      hello you choco        NaN
5  this is a sentence   very long

My problem is, that my original df has millions of lines, and this particular operation takes forever with a huge DataFrame.

What would be a more performant, maybe vectorized way to tackle this?

(The .contains method seems to work only to find one specific string in a vector, not pairwise. It was my best lead but couldn’t make it work.)



Reading this thread and this one, I managed to ridiculously reduced the process time using list comprehensions. Here comes method_3:

def method_3(df):
    df["to_find"] = df["to_find"].fillna('')
    df['temp_origin'] = df['origin'].copy()
    df['origin'] = [' '.join([x for x in a.split() if x not in set(b.split())]) for a, b in zip(df['origin'], df['to_find'])]

    df['temp_origin'] = [' '.join([x for x in a.split(' ') if x not in set(b.split(' '))]) for a, b in zip(df['temp_origin'], df['origin'])]
    df['temp_origin'] = df['temp_origin'].replace('', pd.np.nan)
    del df['to_find']
    df.rename(columns={'temp_origin': 'to_find'}, inplace=True)
    return df

Now with new timings:

Method 1 took 13.820100281387568 sec.
Method 2 took 2.89176794141531 sec.
Method 3 took 0.26977075077593327 sec.

The three approaches are O(n), but it’s up to 50x faster using method_3.

Original post

Largely inspired by @sygneto’s answer, I managed to improve speed by almost 5 times.

Two different methods

I put my first method in a function called method_1 and the other in method_2:

def find_word(row):
    if row.to_find is pd.np.nan:
        return row

    if row.to_find in row.origin:
        row.origin = row.origin.replace(row.to_find, '').strip()
        row.to_find = pd.np.nan

    return row

def method_1(df):
    return df.apply(find_word, axis=1)

def method_2(df):
    df = df.fillna('')
    df['temp_origin'] = df['origin']
    df["origin"] = df.apply(lambda x: x["origin"].replace(x["to_find"], ""), axis=1)
    df["to_find"] = df.apply(lambda x: pd.np.nan if x["origin"] == (x["temp_origin"]) else x["to_find"], axis=1)
    del df['temp_origin']
    return df

Measure speed for both methods

To compare the time spent, I took my initial DataFrame and concated it 10000 times:

from timeit import default_timer

df = pd.concat([data] * 10000)

t0 = default_timer()
new_df_1 = method_1(df)
t1 = default_timer()

df = pd.concat([data] * 10000)

t2 = default_timer()
new_df_2 = method_2(df)
t3 = default_timer()

print(f"Method 1 took {t1-t0} sec.")
print(f"Method 2 took {t3-t2} sec.")

which outputs:

Method 1 took 11.803373152390122 sec.
Method 2 took 2.362371975556016 sec.

There is probably some space of improvements, but still a big step has been taken.

Answered By – pierre_loic

Answer Checked By – Mildred Charles (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *