skip to Main Content

I have the following list of tuples.

[('0', 'Hadoop'), ('0', 'Big Data'), ('0', 'HBas'), ('0', 'Java'), ('0', 'Spark'), ('0', 'Storm'), ('0', 'Cassandra'), ('1', 'NoSQL'), ('1', 'MongoDB'), ('1', 'Cassandra'), ('1', 'HBase'), ('1', 'Postgres'), ('2', 'Python'), ('2', 'skikit-learn'), ('2', 'scipy'), ('2', 'numpy'), ('2', 'statsmodels'), ('2', 'pandas'), ('3', 'R'), ('3', 'Python'), ('3', 'statistics'), ('3', 'regression'), ('3', 'probability'), ('4', 'machine learning'), ('4', 'regression'), ('4', 'decision trees'), ('4', 'libsvm'), ('5', 'Python'), ('5', 'R'), ('5', 'Java'), ('5', 'C++'), ('5', 'Haskell'), ('5', 'programming languages'), ('6', 'statistics'), ('6', 'probability'), ('6', 'mathematics'), ('6', 'theory'), ('7', 'machine learning'), ('7', 'scikit-learn'), ('7', 'Mahout'), ('7', 'neural networks'), ('8', 'neural networks'), ('8', 'deep learning'), ('8', 'Big Data'), ('8', 'artificial intelligence'), ('9', 'Hadoop'), ('9', 'Java'), ('9', 'MapReduce'), ('9', 'Big Data')]

The values on the left are “employee id numbers” while the values on the right are “interests”. I have to turn these into dictionaries in two different ways: I have to make the employee id number the key and the interests the value, then I have to make the interests the key and the employee id number the value. Basically, as a quick example, I need one of the elements of my end result to look like this:

{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'],
 '1' ... etc]}

Then the next would look like this:

{'Hadoop': [0,9]...}

I tried default dict but couldn’t seem to get it to work. Any suggestions?

7

Answers


  1. You can use collections.defaultdict

    Ex:

    from collections import defaultdict
    
    lst = [('0', 'Hadoop'),
    ('0', 'Big Data'),
    ('0', 'HBas'),
    ('0', 'Java'),.....]
    
    result = defaultdict(list)
    for idVal, interest in lst:
        result[idVal].append(interest)
    print(result)
    
    result = defaultdict(list)
    for idVal, interest in lst:
        result[interest].append(idVal)
    print(result)
    

    Output:

    defaultdict(<type 'list'>, {'1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'], '0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '3': ['R', 'Python', 'statistics', 'regression', 'probability'], '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'], '5': ['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'], '4': ['machine learning', 'regression', 'decision trees', 'libsvm'], '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks'], '6': ['statistics', 'probability', 'mathematics', 'theory'], '9': ['Hadoop', 'Java', 'MapReduce', 'Big Data'], '8': ['neural networks', 'deep learning', 'Big Data', 'artificial intelligence']})
    defaultdict(<type 'list'>, {'Java': ['0', '5', '9'], 'neural networks': ['7', '8'], 'NoSQL': ['1'], 'Hadoop': ['0', '9'], 'Mahout': ['7'], 'Storm': ['0'], 'regression': ['3', '4'], 'statistics': ['3', '6'], 'probability': ['3', '6'], 'programming languages': ['5'], 'Python': ['2', '3', '5'], 'deep learning': ['8'], 'Haskell': ['5'], 'mathematics': ['6'], 'HBas': ['0'], 'numpy': ['2'], 'pandas': ['2'], 'artificial intelligence': ['8'], 'theory': ['6'], 'libsvm': ['4'], 'C++': ['5'], 'R': ['3', '5'], 'HBase': ['1'], 'Spark': ['0'], 'Postgres': ['1'], 'decision trees': ['4'], 'Big Data': ['0', '8', '9'], 'MongoDB': ['1'], 'scikit-learn': ['7'], 'MapReduce': ['9'], 'machine learning': ['4', '7'], 'scipy': ['2'], 'skikit-learn': ['2'], 'statsmodels': ['2'], 'Cassandra': ['0', '1']})
    
    Login or Signup to reply.
  2. collections.defaultdict is indeed the right way to go about this. Create one for each dictionary you want, then loop over the list and add each pair to both dictionaries.

    import collections
    
    ids = collections.defaultdict(list)
    interests = collections.defaultdict(list)
    
    for ident,interest in data:
        ids[ident].append(interest)
        interests[interest].append(ident)
    
    Login or Signup to reply.
  3. You can also do this using a set and dict comprehension.

    data = [('0', 'Hadoop'),
    ('0', 'Big Data'),
    ('0', 'HBas'),
    ('0', 'Java'),
    ...]
    
    ids = {id_[0] for id_ in data}
    d = {id_: [intrest[1] for intrest in data if intrest[0] == id_] for id_ in ids}
    

    This results in:

    {'9': ['Hadoop', 'Java', 'MapReduce', 'Big Data'], '8': ['neural networks', 'deep learning', 'Big Data', 'artificial intelligence'], '6': ['statistics', 'probability', 'mathematics', 'theory'], '3': ['R', 'Python', 'statistics', 'regression', 'probability'], '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'], '5':['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'],'4': ['machine learning', 'regression', 'decision trees', 'libsvm'], '0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'], '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks']}
    

    Edit

    This is more efficient if using itertools groupby.

    from itertools import groupby
    from operator import itemgetter
    
    id_intrests = groupby(data, key=itemgetter(0))
    d = {id_: [_[1] for _ in intrests] for id_, intrests in id_intrests}
    
    Login or Signup to reply.
  4. How about pandas?

    data = [('0', 'Hadoop'),
    ('0', 'Big Data'),
    ('0', 'HBas'),...]
    
    import pandas as pd
    df = pd.DataFrame(data)
    df_1 = df.groupby(0)[1].apply(list)
    df_2 = df.groupby(1)[0].apply(list)
    
    print( df_1.to_dict() )
    print( df_2.to_dict() )
    

    Outcome:

    {'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', '...
    {'Big Data': ['0', '8', '9'], 'C++' ...
    
    Login or Signup to reply.
  5. Another approach would be to use itertools.groupby:

    import itertools
    
    tups = [('0', 'Hadoop'),
    ('0', 'Big Data'),
    ('0', 'HBas'),
    ...]
    
    {k:list(zip(*v))[1] for k, v in itertools.groupby(tups, key=lambda x:x[0])}
    
    {'0': ('Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'),
    ...
     '9': ('Hadoop', 'Java', 'MapReduce', 'Big Data')}
    
    {k:list(zip(*v))[0] for k, v in itertools.groupby(sorted(tups, key=lambda x:x[1]), key=lambda x:x[1])}
    
    {'Big Data': ('0', '8', '9'),
     ...
     'theory': ('6',)}
    
    Login or Signup to reply.
  6. Most pythonic and shortest code and without using imports that I can think of:

    alist = [('0', 'Hadoop'),
    ('0', 'Big Data'),
    ('0', 'HBas'),
    ('0', 'Java'),
    ('0', 'Spark'),
    ('0', 'Storm'),...]
    
    adict = {}
    bdict = {}
    for key, value in alist:
        adict[key] = adict.get(key, []) + [value]
        bdict[value] = bdict.get(value, []) + [key]
    

    Outputs:

    print(adict)
    #{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'],...}
    
    print(bdict)
    #{'Hadoop': ['0', '9'], 'Big Data': ['0', '8', '9'], 'HBas': ['0'], 'Java': ['0', '5', '9'], 'Spark': ['0'], 'Storm': ['0'],...}
    
    Login or Signup to reply.
  7. defaultdict is the faster option, but you could also group with setdefault() with one pass through the list:

    d1 = {}
    d2 = {}
    for fst, snd in l:
        d1.setdefault(fst, []).append(snd)
        d2.setdefault(snd, []).append(fst)
    
    print(d1)
    print(d2)
    

    Which Outputs:

    {'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'],
     '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'],
     '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'],
     '3': ['R', 'Python', 'statistics', 'regression', 'probability'],
     '4': ['machine learning', 'regression', 'decision trees', 'libsvm'],
     '5': ['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'],
     '6': ['statistics', 'probability', 'mathematics', 'theory'],
     '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks'],
     '8': ['neural networks',
           'deep learning',
           'Big Data',
           'artificial intelligence'],
     '9': ['Hadoop', 'Java', 'MapReduce', 'Big Data']}
    {'Big Data': ['0', '8', '9'],
     'C++': ['5'],
     'Cassandra': ['0', '1'],
     'HBas': ['0'],
     'HBase': ['1'],
     'Hadoop': ['0', '9'],
     'Haskell': ['5'],
     'Java': ['0', '5', '9'],
     'Mahout': ['7'],
     'MapReduce': ['9'],
     'MongoDB': ['1'],
     'NoSQL': ['1'],
     'Postgres': ['1'],
     'Python': ['2', '3', '5'],
     'R': ['3', '5'],
     'Spark': ['0'],
     'Storm': ['0'],
     'artificial intelligence': ['8'],
     'decision trees': ['4'],
     'deep learning': ['8'],
     'libsvm': ['4'],
     'machine learning': ['4', '7'],
     'mathematics': ['6'],
     'neural networks': ['7', '8'],
     'numpy': ['2'],
     'pandas': ['2'],
     'probability': ['3', '6'],
     'programming languages': ['5'],
     'regression': ['3', '4'],
     'scikit-learn': ['7'],
     'scipy': ['2'],
     'skikit-learn': ['2'],
     'statistics': ['3', '6'],
     'statsmodels': ['2'],
     'theory': ['6']}
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search