Issue
Assume, there are two pandas DataFrame: df1 & df2. The df1 is a square data frame such as following
import numpy as np
import pandas as pd
item_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size=(9,9)),2)
df1 = pd.DataFrame(nums, index=item_names, columns=item_names)
df1 output:
2 7 9 10 11 13 14 21 24
2 0.70 0.29 0.23 0.55 0.72 0.42 0.98 0.68 0.48
7 0.39 0.34 0.73 0.44 0.06 0.40 0.74 0.18 0.18
9 0.53 0.53 0.63 0.85 0.72 0.61 0.72 0.32 0.36
10 0.23 0.29 0.63 0.09 0.43 0.43 0.49 0.43 0.31
11 0.43 0.89 0.94 0.50 0.62 0.12 0.32 0.41 0.87
13 0.25 0.48 0.99 0.52 0.61 0.12 0.83 0.60 0.55
14 0.34 0.30 0.42 0.68 0.88 0.51 0.67 0.59 0.62
21 0.67 0.84 0.08 0.76 0.24 0.19 0.57 0.10 0.89
24 0.63 0.72 0.02 0.59 0.56 0.16 0.15 0.70 0.32
The df2 stores item and its corresponding group information such as
df2 = pd.DataFrame({'item': item_names,
'group':['a1','a1','a1','a2',
'a2','a2','a2','a3','a3']})
df2 output:
item group
0 2 a1
1 7 a1
2 9 a1
3 10 a2
4 11 a2
5 13 a2
6 14 a2
7 21 a3
8 24 a3
The goal is to write a function which can select top N items in a specific row (item name) based on the corresponding values (largest) using these two DataFrames’ information. However, the returned top N items and query item ALL MUST from ‘different groups‘. Such as
A query item (item = 10) is in the 4th row of df1 (item = 10). The top 2 returned items will be [9, 21]
not [9, 14]
. Since, item 10 is from group = a2 and any of returned items (top N) should not from a2 group. I have checked Scott Boston solution for a similar problem but it can’t avoid the top N items and query item are from same group. Any suggestions? many thanks
Solution
IIUC, you want to select the N largest values excluding the values from the same group.
Here is a function that does this:
def get_top_N(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
return df1.loc[idx, incl].nlargest(2).index.to_list()
get_top_N(10)
# [9, 21]
If you additionally want to ensure that all values are from different groups (this was unclear if a requirement, as this is the case for your example). You can additionally do:
def get_top_N_diff(idx, N=2):
group = df2.set_index('item')['group']
incl = group[group.ne(group[idx])].index
s = df1.loc[idx, incl]
return s.sort_values(ascending=False).groupby(group).idxmax().to_list()[:N]
get_top_N(11) # same group
# [9, 7]
get_top_N_diff(11) # different groups
# [9, 24]
Answered By – mozway
Answer Checked By – Clifford M. (BugsFixing Volunteer)