I have a dictionary as follows.
{'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
'data': [ 'text mining', 'artificial intelligence', 'data']}
I want to re-arrange the dictionary in the following way. i.e. remove entries that has similar values
by considering the longest key
.
{'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence']}
In other words, both data mining
and data
have similar values. Therefore, I remove one entry and make the longest word as the key of the new enrty. i.e. 'data mining': ['data', 'text mining', 'artificial intelligence']
.
My current code is as follows.
import collections
compare = lambda x, y: collections.Counter(x) == collections.Counter(y)
myresults = {}
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'],
'neural networks': ['cnn', 'rnn', 'artificial intelligence'],
'data': [ 'text mining', 'artificial intelligence','data']}
for key1, value1 in mydata.items():
for key2, value2 in mydata.items():
if compare(value1,value2):
mykeys = [key1, key2]
temp = {max((mykeys), key=len): value1}
myresults.update(temp)
print(myresults)
However, my real dictionary dataset has around 4 million of entries. So, I am wondering if there is an efficient way of doing this in python.
I am happy to provide more details if needed 🙂
3
Answers
This should be faster than comparing each element to each other as in your current code.
Here i define a comparator sorting the strings in the list values and joining them. Not sure if it is more efficient than yours which use Counters.
I loop over the dictionary once and store the strings generated by the comparator in a
set()
. Each iteration of the loop I check if the new comparator string is in the set. If not, I add it to the set for future reference and add the key – value pair to the final result dictionary. Otherwise I check the key lengths and change the key of the dict as shown here if the new key is longer. I also need another dictionary in which I switch the key – compvalue (compvalue are the keys and key are the values) in order to keep track of which is the key of each compared value.Should be faster (I did not check the time) because I have a single loop. The equivalent of your second loop is
set([compvalue]).issubset(compared_values)
andset
is more efficient than afor
loop for this kind of jobs.Have a try and see if it helps.
EDIT
Another similar idea which does not use
set
just came to my mind.Here I just try the
if
statement. Ifreferencekeys[compvalue]
throws aKeyError
means that the code has not found yet a similar value. Otherwise check for the key length.Again I did not check the execution times, so I am not sure which is more efficient. But the result is the same.
EDIT 2
Following comment request, to keep empty lists as they are is enough to wrap the body of the loop in an
if
statement (here I use the first code, but the same idea can be implemented for the second).No need to store the key in
referencekeys
iflen(value)
== 0. If the original datamydata
is a single dictionary, keys are unique. So it’s guaranteed that you are not overwriting anything.For example, if you have
mydata = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data': [ 'text mining', 'artificial intelligence','data'], 'data bis':[], 'neural link':[]}
you will get:myresults = {'data mining': ['data', 'text mining', 'artificial intelligence'], 'neural networks': ['cnn', 'rnn', 'artificial intelligence'], 'data bis': [], 'neural link': []}
You can first sort the dictionary by length, so the longer keys are guaranteed to occur first.
It’s also only one line, which is always nice! 🙂
Printing
result
yields:Python built-in types to the rescue!