Table of Contents

## Issue

# Ultimate Question

**Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?**

# Input

```
pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'], [2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])`
```

```
id begin_date status
0 1 2020-02-01 a
1 1 2020-02-10 b
2 1 2020-02-17 c
3 2 2020-02-02 d
4 2 2020-03-06 b
```

# Desired Output

```
id status count uniquecount
0 1 a 1 1
1 1 b 1 1
2 1 c 1 1
3 2 b 1 1
4 2 c 1 1
```

# Problem

Now, there is an easy way to do that in Python, using Pandas.

```
df = df.groupby(["id", "status"]).agg(count=("begin_date", "count"), uniquecount=("begin_date", lambda x: x.nunique())).reset_index()
# As commented, omitting the lambda and replacing it with "begin_date", "nunique" will be faster. Thanks!
```

**This operation is slow for larger datasets, I’d take a guess and say O(n²).**

# Existent solutions that lack the desired general applicability

Now, after some googling, there are some alternative solutions on StackOverflow, either using numpy, iterrows, or different other ways.

Faster alternative to perform pandas groupby operation

Pandas fast weighted random choice from groupby

And an excellent one:

Groupby in python pandas: Fast Way

These solutions generally aim to create the "count" or "uniquecount" in my example, basically the aggregated value. But, unfortunately, always only one aggregation and not with multiple groupby columns.

Also, they unfortunately never explain how to merge them into the grouped dataframe.

Is there a way to use itertools (Like this answer: Faster alternative to perform pandas groupby operation, or even better this answer: Groupby in python pandas: Fast Way) that do not only return the series "count", but the whole dataframe in grouped form?

# Ultimate Question

**Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?**

This would look something like this:

```
from typing import List
def fastGroupby(df, groupbyColumns: List[str], aggregateColumns):
# numpy / iterrow magic
return df_grouped
df = fastGroupby(df, ["id", "status"], {'status': 'count',
'status': 'count'}
```

And return the desired output.

## Solution

Before ditching `groupby`

I’d suggest first evaluating whether you are truly taking advantage of what `groupby`

has to offer.

### Do away with `lambda`

in favor of built-in `pd.DataFrameGroupBy`

methods.

Many of the `Series`

and `DataFrame`

methods are implemented as `pd.DataFrameGroupBy`

methods. You should use those directly as opposed to calling them with a `groupby`

+ `apply(lambda x: ...)`

Further, for many calculations you can re-frame the problem as some vectorized operation on an entire DataFrame that then uses a groupby method implemented in cython. This will be fast.

A common example of this would be finding the proportion of `'Y'`

answers within a group. A straight-forward approach would be to check the condition within each group then get the proportion:

```
N = 10**6
df = pd.DataFrame({'grp': np.random.choice(range(10000), N),
'answer': np.random.choice(['Y', 'N'], N)})
df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
```

Thinking about the problem this way requires the `lambda`

, because we do two operations within the groupby; check the condition then average. This exact same calculation can be thought of as first checking the condition on the entire DataFrame then calculating average within group:

```
df['answer'].eq('Y').groupby(df['grp']).mean()
```

This is a very minor change yet the consequences are huge, and the gains will become greater as the number of groups increases.

```
%timeit df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
#2.32 s ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df['answer'].eq('Y').groupby(df['grp']).mean()
#82.8 ms ± 995 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

### Add `sort=False`

as an argument

By default `groupby`

sorts the output on the keys. If there is no reason to have a sorted output you can get a slight gain specifying `sort=False`

### Add `observed=True`

as an argument

If grouping keys are categorical it will reindex to all possible combinations, even for groups that never appear in your DataFrame. If these are not important, removing them from the output will greatly improve the speed.

For your example we can examine the difference. There’s an enormous gain switching to `pd.DataFrameGroupBy.nunique`

and removing the sorting adds a little extra speed. The combination of both gives an "identical" solution (up to sorting), and is nearly 100x faster for many groups.

```
import perfplot
import pandas as pd
import numpy
def agg_lambda(df):
return df.groupby(['id', 'status']).agg(uniquecount=('Col4', lambda x: x.nunique()))
def agg_nunique(df):
return df.groupby(['id', 'status']).agg(uniquecount=('Col4', 'nunique'))
def agg_nunique_nosort(df):
return df.groupby(['id', 'status'], sort=False).agg(uniquecount=('Col4', 'nunique'))
perfplot.show(
setup=lambda N: pd.DataFrame({'Col1': range(N),
'status': np.random.choice(np.arange(N), N),
'id': np.random.choice(np.arange(N), N),
'Col4': np.random.choice(np.arange(N), N)}),
kernels=[
lambda df: agg_lambda(df),
lambda df: agg_nunique(df),
lambda df: agg_nunique_nosort(df),
],
labels=['Agg Lambda', 'Agg Nunique', 'Agg Nunique, No sort'],
n_range=[2 ** k for k in range(20)],
# Equality check same data, just allow for different sorting
equality_check=lambda x,y: x.sort_index().compare(y.sort_index()).empty,
xlabel="~ Number of Groups"
)
```

Answered By – ALollz

Answer Checked By – Timothy Miller (BugsFixing Admin)