Issue
I am working with medical history data and need to figure out to find the series of diagnoses a patient has had.
I have a large database which includes data on unique ID, diagnosis, time of contact with healthcare, and so on.
I made some dummy data here to illustrate:
import pandas as pd
import numpy as np
columns = ["ID","DIAG","TYPE","IN","OUT","GENDER","DOB"]
diags = pd.DataFrame(np.random.randint(0,100,size=(2000,7)),columns=columns)
diags_counter = diags.groupby("ID")["DIAG"].count().to_frame()
diags_counter.reset_index(level=0,inplace=True)
I reset the index, since the IDs in the database are more complex, and using .loc wouldn’t work.
My idea was to make a list or dictionary of a list of dataframes for each patient, i.e. a patient might have only one diagnosis, while another might have 2 or more.
Following code is working, but extremely slow, and since I have over half a million patients, this is not going to work:
diags_seq = []
for i in range(0,len(diags_counter)):
X= []
for j in range(0,len(diags)):
if diags_counter.ID.iloc[i] == diags.ID.iloc[j]:
X.append(diags.iloc[j])
diags_seq.append(X)
print(f"\r{i+1} of {len(diags_counter)} found", end="")
Any help to how to approach this otherwise would be greatly appreciated 🙂
Solution
I think this will be fine:
unique_id = diags.ID.unique()
dict_of_specifics_id = {}
for id in unique_id:
dict_of_specifics_id[id] = {}
dict_of_specifics_id[id]['id_counter'] = 0
dict_of_specifics_id[id]['diag_list'] = []
for index, row in diags.iterrows():
dict_of_specifics_id[row.ID]['id_counter']+=1
dict_of_specifics_id[row.ID]['diag_list'].append(row.DIAG)
dict_of_specifics_id
Output:
{21: {'id_counter': 16,
'diag_list': [45, 41, 92, 91, 62, 54, 16, 18, 23, 18, 0, 47, 9, 45, 2, 61]},
Where 21 is ID and diag_list is list of diagnosis to this ID. id_counter is actualy len of diag_list.
Answered By – Dawid Cimoch
Answer Checked By – Senaida (BugsFixing Volunteer)