[SOLVED] How to speed up pandas apply?

Issue

I have a large DataFrame of 3.14 million rows and I need to group them using 1 column then calculate the std dev and mean by groups. I have done the following and it works as expected, but it is very slow (20000 rows alone take 1min+). I am pretty sure there are much faster ways but I could not find any when I tried to google.

def gen_time_feats(df):
    feats = df.loc[:, 'f_0':'f_299']
    for col in feats.columns:
        df[f'{col}_avg'] = np.mean(feats[col].values)
        df[f'{col}_std'] = np.std(feats[col].values)
    return df

df_grouped = df.groupby('group_id')
df_with_time_feat = df_grouped.apply(gen_time_feats)

The DataFrame has 300 columns and I want to append the std dev and mean for each column as additional columns, so in the end it will become 900 columns.

Solution

Use filter to select the f_xxx columns then groupby group_id column then use a list of functions to apply to each columns. Finally, flat your column index.

out = df.filter(regex='f_\d+').groupby(df['group_id']).agg(['mean', 'std'])
out.columns = out.columns.to_flat_index().str.join('_')
out = df.set_index('group_id').join(out).sort_index(axis=1).reset_index()
print(out)

# Output:
  group_id  f_100  f_100_mean  f_100_std  f_200  f_200_mean  f_200_std
0        A      1         3.0   2.828427      5         7.0   2.828427
1        A      5         3.0   2.828427      9         7.0   2.828427
2        B      2         3.0   1.414214      6         7.0   1.414214
3        B      4         3.0   1.414214      8         7.0   1.414214

Setup:

df = pd.DataFrame({'group_id': list('ABBA'), 'f_100': [1, 2, 4, 5], 'f_200': [5, 6, 8, 9]})
print(df)

# Output
  group_id  f_100  f_200
0        A      1      5
1        B      2      6
2        B      4      8
3        A      5      9

Note if you want to have the name ‘_avg’ instead of ‘_mean’, use the modified version of agg: .agg([('avg', 'mean'), 'std'])

Answered By – Corralien

Answer Checked By – Marilyn (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *