[SOLVED] Quickly replacing information of a pandas dataframe using other dataframe and series

Issue

I am currently trying to replace information of a DataFrame using another DataFrame and a series for my simulation analysis.

Toy example is as follows

A is a user info DataFrame, B is a service info DataFrame, and C is series information about whether the user changed service.

TableA (user's current service info):
        cost   location
John    100    Tokyo
Tom     50     Seoul
Andy    50     Seoul
Mark    80     Seoul

TableB (service info):
             cost    location
premium_T    100     Tokyo
basic_T      60      Tokyo
premium_S    80      Seoul
basic_S      50      Seoul

Table C (service change info):
        change        
John    no  
Tom     no     
Andy    premium_S      
Mark    basic_S  

using above data, I’d like to change information in Table A, using data in Table B and C. In other words, I desire:

TableA' (modified user's service info):
        cost   location
John    100    Tokyo
Tom     50     Seoul
Andy    80     Seoul
Mark    50     Seoul

The code I used is:

TableA = pd.DataFrame(index = ['John', 'Tom', 'Andy', 'Mark'], 
                      data = {'cost': [100,50,50,80],
                     'location': ['Tokyo', 'Seoul', 'Seoul', 'Seoul']})

TableB = pd.DataFrame(index = ['premium_T', 'basic_T', 'premium_S', 'basic_S'],
                      data = {'cost': [100, 60, 80, 50],
                     'location': ['Tokyo','Tokyo','Seoul','Seoul']})  
    
TableC = pd.Series( ['no', 'no', 'premium_S', 'basic_S'], index = ['John', 'Tom', 'Andy', 'Mark'])
    
customer_list = TableA.index.tolist()

for k in customer_list:
    if TableC.loc[k] != 'no':
        TableA.loc[k] = TableB.loc[TableC.loc[k]] 

The code works, and provides the results that I desire.

However, I have to do such work for a very big dataset repeatedly, and I need faster method to do such replacements.

Any ideas? I think repeated use of .loc is the problem, but I have not found probable solution yet. I have looked at pd.update() or pd.replace(), but it does not seem to be what I am looking for.

Solution

First calculate in-scope customers from TableC using reindex and Boolean indexing:

idx = TableC.reindex(TableA.index & TableC.index)
idx = idx[idx != 'no']

Then update TableA via loc:

TableA.loc[np.in1d(TableA.index, idx.index)] = TableB.reindex(idx.values).values

Result:

       cost location
John  100.0    Tokyo
Tom    50.0    Seoul
Andy   80.0    Seoul
Mark   50.0    Seoul

Answered By – jpp

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *