[SOLVED] What is the fastest approach to combine two large dictionaries by key?

Issue

I have two large dictionaries. This is an example to demonstrate but you can imagine each dictionary having close to 100k records:

d1 = {
    '0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],
    '0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42),
             ('winter',-0.12),('kids',0.12)]
}

d2 = {
    '0001': [('action', 0.89),('funny', 0.58),('sports',0.12)],
    '0002': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)]
}

I want to have a dictionary that has combined values by key from each dictionary:

{
    '0001': [
        ('skiing', 0.789), ('snow', 0.65), ('winter', 0.56),
        [('action', 0.89), ('funny', 0.58), ('sports', 0.12)]
    ],
    '0002': [
        ('drama', 0.89), ('comedy', 0.678), ('action', -0.42),
        ('winter', -0.12), ('kids', 0.12),
        [('dark', 0.89), ('Mystery', 0.678), ('crime', 0.12), ('adult', -0.423)]
    ]
}

The way I would achieve this is:

for key, value in d1.iteritems():
    if key in d2:
        d1[key].append(d2[key])

But after reading in many places I found out that iteritems() is really slow and doesn’t actually use C data structures to do it, but uses Python functions. How can I do this combine/merge process fast and efficiently?

Solution

If your input has to be dicts, I don’t think you will find anything faster than iteritems. If one dict has notably fewer keys than the other, you should iterate over the smaller one to cut down on iterations.

Any solution that involves converting your dict into a different data type will cost way more in creation time than it could ever save you through efficient loops. Instead of iterating once, you have to iterate three times (twice to create the two new collections and once to run your merge).

If you do have the option of using a different data type when initially creating your collections, iteration over lists (using indices instead of keys) will be slightly faster. Of course, this means you lose all other advantages that dicts might offer you.

timeit gives the following speeds for various suggested approaches, given two dicts of 200 keys (same keys for both dicts):

To give set intersection another chance, let only half the keys of d2 actually match those of d1:

  • iteritem: 15.5433290005
  • set intersection: 22.1796240807

As we can see, the cost of creating two sets still outweighs any potential benefit (at least in Python 2, where dict.keys() gives a list, rather than a set-operation-compatible view).

Side note: Append vs Extend

In your current code example

for key,value in d1.iteritems():
    if key in d2:
        d1[key].append(d2[key])

you are appending the entire list of d2 as a single new item of the list in d1, rather than merging the lists ([1,2].append([3,4]) == [1,2,[3,4]], not [1,2,3,4]). You could loop over the list from d2 and call append multiple times, but extend() will be faster.

Answered By – Marc Schulder

Answer Checked By – Senaida (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *