skip to Main Content

I have a dictionary as follows.

{'data mining': ['data', 'text mining', 'artificial intelligence'],
 'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
 'data': [ 'text mining', 'artificial intelligence', 'data']}

I want to re-arrange the dictionary in the following way. i.e. remove entries that has similar values by considering the longest key.

{'data mining': ['data', 'text mining', 'artificial intelligence'],
 'neural networks': ['cnn', 'rnn', 'artificial intelligence']}

In other words, both data mining and data have similar values. Therefore, I remove one entry and make the longest word as the key of the new enrty. i.e. 'data mining': ['data', 'text mining', 'artificial intelligence'].

My current code is as follows.

import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)
myresults = {}
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'],
          'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
          'data': [ 'text mining', 'artificial intelligence','data']}

    for key1, value1 in mydata.items():
        for key2, value2 in mydata.items():
            if compare(value1,value2):
                mykeys = [key1, key2]
                temp = {max((mykeys), key=len): value1}
                myresults.update(temp)

    print(myresults)

However, my real dictionary dataset has around 4 million of entries. So, I am wondering if there is an efficient way of doing this in python.

I am happy to provide more details if needed 🙂

3

Answers


  1. This should be faster than comparing each element to each other as in your current code.

    mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data']}
    
    compared_values = set()
    referencekeys = {}
    myresults = {}
    
    comparator = lambda x : ''.join(sorted(x))
    
    for key, value in mydata.items():
        compvalue = comparator(value)
        if not set([compvalue]).issubset(compared_values):
            compared_values.update([compvalue])
            referencekeys[compvalue] = key
            myresults[key] = value
        else:
            if len(key) > len(referencekeys[compvalue]):
                myresults[key] = myresults.pop(referencekeys[compvalue])
                referencekeys[compvalue] = key
    
    print(myresults)
    

    Here i define a comparator sorting the strings in the list values and joining them. Not sure if it is more efficient than yours which use Counters.

    I loop over the dictionary once and store the strings generated by the comparator in a set(). Each iteration of the loop I check if the new comparator string is in the set. If not, I add it to the set for future reference and add the key – value pair to the final result dictionary. Otherwise I check the key lengths and change the key of the dict as shown here if the new key is longer. I also need another dictionary in which I switch the key – compvalue (compvalue are the keys and key are the values) in order to keep track of which is the key of each compared value.

    Should be faster (I did not check the time) because I have a single loop. The equivalent of your second loop is set([compvalue]).issubset(compared_values) and set is more efficient than a for loop for this kind of jobs.

    Have a try and see if it helps.

    EDIT

    Another similar idea which does not use set just came to my mind.

    referencekeys = {}
    myresults = {}
    
    comparator = lambda x : ''.join(sorted(x))
    
    for key, value in mydata.items():
        compvalue = comparator(value)
        try:
            if len(key) > len(referencekeys[compvalue]):
                myresults[key] = myresults.pop(referencekeys[compvalue])
                referencekeys[compvalue] = key
        except KeyError:
            referencekeys[compvalue] = key
            myresults[key] = value
    
    print(myresults)
    

    Here I just try the if statement. If referencekeys[compvalue] throws a KeyError means that the code has not found yet a similar value. Otherwise check for the key length.

    Again I did not check the execution times, so I am not sure which is more efficient. But the result is the same.

    EDIT 2

    Following comment request, to keep empty lists as they are is enough to wrap the body of the loop in an if statement (here I use the first code, but the same idea can be implemented for the second).

    for key, value in mydata.items():
        if len(value) > 0:
            compvalue = comparator(value)
            if not set([compvalue]).issubset(compared_values):
                compared_values.update([compvalue])
                referencekeys[compvalue] = key
                myresults[key] = value
            else:
                if len(key) > len(referencekeys[compvalue]):
                    myresults[key] = myresults.pop(referencekeys[compvalue])
                    referencekeys[compvalue] = key
        else:
            myresults[key] = value
    

    No need to store the key in referencekeys if len(value) == 0. If the original data mydata is a single dictionary, keys are unique. So it’s guaranteed that you are not overwriting anything.

    For example, if you have mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data'], 'data bis':[], 'neural link':[]} you will get: myresults = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data bis': [], 'neural link': []}

    Login or Signup to reply.
  2. You can first sort the dictionary by length, so the longer keys are guaranteed to occur first.

    from itertools import groupby
    
    d = {
        "data mining": ["data", "text mining", "artificial intelligence"],
        "neural networks": ["cnn", "rnn", "artificial intelligence"],
        "data": ["text mining", "artificial intelligence", "data"],
    }
    
    result = dict(
        g
        for k, (g, *_) in groupby(
            sorted(d.items(), key=lambda x: len(x[0]), reverse=True),
            key=lambda x: sorted(x[1]),
        )
    )
    

    It’s also only one line, which is always nice! 🙂

    Printing result yields:

    {'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
     'data mining': ['data', 'text mining', 'artificial intelligence']}
    
    Login or Signup to reply.
  3. Python built-in types to the rescue!

    tmp = dict()
    for topic, words in data.items():
        ww = frozenset(words)
        tmp[ww] = max(tmp.get(ww, topic), topic, key=len)
    result = {topic: list(ww) for ww, topic in tmp.items()}
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search