## Issue

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:

- Was_Sold (Y/N)
- Color
- Size_Group
- Shape
- Date_Made

Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).

Then for *every row in df_a*, I want to:

- Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
- The df_b toys must also be made before the given df_a toy
- Then find the ratio of those sold (So count sold / count all matched)

**What is the most efficient means to make those per-row calculations above?**

The best I’ve came up with so far is something like the below.

(Note code might have an error or two as I’m rough typing from a different use case)

```
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
```

Using `.itertuples()`

is useful, but this is still pretty slow. Is there a more efficient method or something I’m missing?

**EDIT**

Added the below script which will emulated data for the above scenario:

```
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
```

## Solution

First of all, I think your computation would be much more efficient using *relational database* and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform *efficiently* advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.

Additionally, pure-Python **object types are slow**, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the `Was_Sold`

column to contains "Y"/"N" string objects: a *boolean* can just be used in that case. Thus let us convert that:

```
df_b.Was_Sold = df_b.Was_Sold == "Y"
```

Finally, the current algorithm has a **bad complexity**: `O(Na * Nb)`

where `Na`

is the number of rows in `df_a`

and `Nb`

is the number of rows in `df_b`

. This is not easy to improve though due to the non-trivial conditions. A first solution is to group `df_b`

by `col`

column ahead-of-time so to avoid an expensive complete iteration of `df_b`

(previously done with `df_b[col] == relevant_val`

). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using `np.sum`

).

Note that doing `prior_toys['Was_Sold']`

is a bit faster than `prior_toys.Was_Sold`

.

Here is the resulting code:

```
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
```

This is **5.5 times faster** on my machine.

The iteration of the pandas dataframe is a major source of slowdown. Indeed, `prior_toys['Was_Sold']`

takes half the computation time because of the huge overhead of pandas internal function calls repeated `Na`

times… Using **Numba** may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting `colGroups`

in subgroups ahead of time (`O(Na log Nb)`

). This should help to completely remove the overhead of `prior_sold_count`

. The resulting program should be about **10 time faster** than the original one.

Answered By – Jérôme Richard

Answer Checked By – Dawn Plyler (BugsFixing Volunteer)