Issue
I have a large DataFrame of 3.14 million rows and I need to group them using 1 column then calculate the std dev and mean by groups. I have done the following and it works as expected, but it is very slow (20000 rows alone take 1min+). I am pretty sure there are much faster ways but I could not find any when I tried to google.
def gen_time_feats(df):
feats = df.loc[:, 'f_0':'f_299']
for col in feats.columns:
df[f'{col}_avg'] = np.mean(feats[col].values)
df[f'{col}_std'] = np.std(feats[col].values)
return df
df_grouped = df.groupby('group_id')
df_with_time_feat = df_grouped.apply(gen_time_feats)
The DataFrame has 300 columns and I want to append the std dev and mean for each column as additional columns, so in the end it will become 900 columns.
Solution
Use filter
to select the f_xxx
columns then groupby
group_id
column then use a list of functions to apply to each columns. Finally, flat your column index.
out = df.filter(regex='f_\d+').groupby(df['group_id']).agg(['mean', 'std'])
out.columns = out.columns.to_flat_index().str.join('_')
out = df.set_index('group_id').join(out).sort_index(axis=1).reset_index()
print(out)
# Output:
group_id f_100 f_100_mean f_100_std f_200 f_200_mean f_200_std
0 A 1 3.0 2.828427 5 7.0 2.828427
1 A 5 3.0 2.828427 9 7.0 2.828427
2 B 2 3.0 1.414214 6 7.0 1.414214
3 B 4 3.0 1.414214 8 7.0 1.414214
Setup:
df = pd.DataFrame({'group_id': list('ABBA'), 'f_100': [1, 2, 4, 5], 'f_200': [5, 6, 8, 9]})
print(df)
# Output
group_id f_100 f_200
0 A 1 5
1 B 2 6
2 B 4 8
3 A 5 9
Note if you want to have the name ‘_avg’ instead of ‘_mean’, use the modified version of agg
: .agg([('avg', 'mean'), 'std'])
Answered By – Corralien
Answer Checked By – Marilyn (BugsFixing Volunteer)