[SOLVED] Select top-N from two pandas DataFrames

Issue

Assume, there are two pandas DataFrame: df1 & df2. The df1 is a square data frame such as following

import numpy as np 
import pandas as pd
item_names = [2,7,9,10,11,13,14,21,24]
np.random.seed(123)
nums = np.round(np.random.random(size=(9,9)),2)
df1 = pd.DataFrame(nums, index=item_names, columns=item_names)

df1 output:

      2     7     9     10    11    13    14    21    24
2   0.70  0.29  0.23  0.55  0.72  0.42  0.98  0.68  0.48
7   0.39  0.34  0.73  0.44  0.06  0.40  0.74  0.18  0.18
9   0.53  0.53  0.63  0.85  0.72  0.61  0.72  0.32  0.36
10  0.23  0.29  0.63  0.09  0.43  0.43  0.49  0.43  0.31
11  0.43  0.89  0.94  0.50  0.62  0.12  0.32  0.41  0.87
13  0.25  0.48  0.99  0.52  0.61  0.12  0.83  0.60  0.55
14  0.34  0.30  0.42  0.68  0.88  0.51  0.67  0.59  0.62
21  0.67  0.84  0.08  0.76  0.24  0.19  0.57  0.10  0.89
24  0.63  0.72  0.02  0.59  0.56  0.16  0.15  0.70  0.32

The df2 stores item and its corresponding group information such as

df2 = pd.DataFrame({'item': item_names,
                    'group':['a1','a1','a1','a2',
                             'a2','a2','a2','a3','a3']})

df2 output:

   item group
0     2    a1
1     7    a1
2     9    a1
3    10    a2
4    11    a2
5    13    a2
6    14    a2
7    21    a3
8    24    a3

The goal is to write a function which can select top N items in a specific row (item name) based on the corresponding values (largest) using these two DataFrames’ information. However, the returned top N items and query item ALL MUST from ‘different groups‘. Such as

A query item (item = 10) is in the 4th row of df1 (item = 10). The top 2 returned items will be [9, 21] not [9, 14]. Since, item 10 is from group = a2 and any of returned items (top N) should not from a2 group. I have checked Scott Boston solution for a similar problem but it can’t avoid the top N items and query item are from same group. Any suggestions? many thanks

Solution

IIUC, you want to select the N largest values excluding the values from the same group.

Here is a function that does this:

def get_top_N(idx, N=2):
    group = df2.set_index('item')['group']
    incl = group[group.ne(group[idx])].index
    return df1.loc[idx, incl].nlargest(2).index.to_list()
    
get_top_N(10)
# [9, 21]

If you additionally want to ensure that all values are from different groups (this was unclear if a requirement, as this is the case for your example). You can additionally do:

def get_top_N_diff(idx, N=2):
    group = df2.set_index('item')['group']
    incl = group[group.ne(group[idx])].index
    s = df1.loc[idx, incl]
    return s.sort_values(ascending=False).groupby(group).idxmax().to_list()[:N]

get_top_N(11)      # same group
# [9, 7]

get_top_N_diff(11) # different groups
# [9, 24]

Answered By – mozway

Answer Checked By – Clifford M. (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *