Table of Contents

## Issue

I have a DataFrame that looks like this:

```
import pandas as pd
rows = [
('chocolate', 'choco'),
('banana', pd.np.nan),
('hello world', 'world'),
('hello you', 'world'),
('hello you choco', 'world'),
('this is a very long sentence', 'very long')
]
data = pd.DataFrame.from_records(rows, columns=['origin', 'to_find'])
```

```
origin to_find
0 chocolate choco
1 banana NaN
2 hello world world
3 hello you world
4 hello you choco world
5 this is a very long sentence very long
```

My goal is to search the second column string within the first column, and remove it. If I do not find the substring `to_find`

within `origin`

, I replace the `to_find`

with a `NaN`

. Because this is a string operation that needs to be done row by row, I chose the `apply`

way. I need to extract entire words.

This is my function, that works *almost** as expected, and how I `apply`

it:

```
def find_word(row):
# Handle the case where to_find is already NaN
if row.to_find is pd.np.nan:
return row
if row.to_find in row.origin:
row.origin = row.origin.replace(row.to_find, '').strip()
else:
row.to_find = pd.np.nan
return row
new_df = data.apply(find_word, axis=1)
```

* this code returns two spaces instead of one between `this is a`

and `sentence`

, which is not desirable.

The `new_df`

is expected to look like this:

```
origin to_find
0 late choco
1 banana NaN
2 hello world
3 hello you NaN
4 hello you choco NaN
5 this is a sentence very long
```

My problem is, that my original `df`

has millions of lines, and this particular operation takes forever with a huge DataFrame.

What would be a more performant, maybe vectorized way to tackle this?

(The `.contains`

method seems to work only to find one specific string in a vector, not pairwise. It was my best lead but couldn’t make it work.)

## Solution

# Update

Reading this thread and this one, I managed to ridiculously reduced the process time using list comprehensions. Here comes `method_3`

:

```
def method_3(df):
df["to_find"] = df["to_find"].fillna('')
df['temp_origin'] = df['origin'].copy()
df['origin'] = [' '.join([x for x in a.split() if x not in set(b.split())]) for a, b in zip(df['origin'], df['to_find'])]
df['temp_origin'] = [' '.join([x for x in a.split(' ') if x not in set(b.split(' '))]) for a, b in zip(df['temp_origin'], df['origin'])]
df['temp_origin'] = df['temp_origin'].replace('', pd.np.nan)
del df['to_find']
df.rename(columns={'temp_origin': 'to_find'}, inplace=True)
return df
```

Now with new timings:

```
Method 1 took 13.820100281387568 sec.
Method 2 took 2.89176794141531 sec.
Method 3 took 0.26977075077593327 sec.
```

The three approaches are `O(n)`

, but it’s up to 50x faster using `method_3`

.

# Original post

Largely inspired by @sygneto’s answer, I managed to improve speed by almost 5 times.

## Two different methods

I put my first method in a function called `method_1`

and the other in `method_2`

:

```
def find_word(row):
if row.to_find is pd.np.nan:
return row
if row.to_find in row.origin:
row.origin = row.origin.replace(row.to_find, '').strip()
else:
row.to_find = pd.np.nan
return row
def method_1(df):
return df.apply(find_word, axis=1)
def method_2(df):
df = df.fillna('')
df['temp_origin'] = df['origin']
df["origin"] = df.apply(lambda x: x["origin"].replace(x["to_find"], ""), axis=1)
df["to_find"] = df.apply(lambda x: pd.np.nan if x["origin"] == (x["temp_origin"]) else x["to_find"], axis=1)
del df['temp_origin']
return df
```

## Measure speed for both methods

To compare the time spent, I took my initial DataFrame and `concat`

ed it 10000 times:

```
from timeit import default_timer
df = pd.concat([data] * 10000)
t0 = default_timer()
new_df_1 = method_1(df)
t1 = default_timer()
df = pd.concat([data] * 10000)
t2 = default_timer()
new_df_2 = method_2(df)
t3 = default_timer()
print(f"Method 1 took {t1-t0} sec.")
print(f"Method 2 took {t3-t2} sec.")
```

which outputs:

```
Method 1 took 11.803373152390122 sec.
Method 2 took 2.362371975556016 sec.
```

There is probably some space of improvements, but still a big step has been taken.

Answered By – pierre_loic

Answer Checked By – Mildred Charles (BugsFixing Admin)