skip to Main Content

I have automated identification system (AIS) data as nested dictionaries in a pandas dataframe. Here is an example:

dfAIS
Out[18]:
                                                  Message  ...                                           MetaData
0       {'ShipStaticData': {'AisVersion': 1, 'CallSign...  ...  {'MMSI': 255814000, 'MMSI_String': 255814000, ...
1       {'MultiSlotBinaryMessage': {'ApplicationID': {...  ...  {'MMSI': 2276003, 'MMSI_String': 2276003, 'Shi...
2       {'StandardClassBPositionReport': {'AssignedMod...  ...  {'MMSI': 503760500, 'MMSI_String': 503760500, ...
3       {'PositionReport': {'Cog': 25.2, 'Communicatio...  ...  {'MMSI': 211648000, 'MMSI_String': 211648000, ...
4       {'StaticDataReport': {'MessageID': 24, 'PartNu...  ...  {'MMSI': 338467989, 'MMSI_String': 338467989, ...
                                                  ...  ...                                                ...
139625  {'PositionReport': {'Cog': 360, 'Communication...  ...  {'MMSI': 244730300, 'MMSI_String': 244730300, ...
139626  {'PositionReport': {'Cog': 231.5, 'Communicati...  ...  {'MMSI': 219025528, 'MMSI_String': 219025528, ...
139627  {'PositionReport': {'Cog': 360, 'Communication...  ...  {'MMSI': 273252100, 'MMSI_String': 273252100, ...
139628                             {'UnknownMessage': {}}  ...  {'MMSI': 244730043, 'MMSI_String': 244730043, ...
139629  {'ShipStaticData': {'AisVersion': 1, 'CallSign...  ...  {'MMSI': 211666470, 'MMSI_String': 211666470, ...

[139630 rows x 3 columns]

The core data resides in column "Message". Each element therein is a dictionary with only 1 key and 1 value. As shown above, the one key might be "ShipStaticData", "MultiSlotBinaryMessage", etc., which we can think of as message types. Let’s call this dictionary the level-1 dictionary.

The level-1 dictionary is an extraneous layer of mapping because the desired data resides in the corresponding value, which itself is an entire dictionary. Let’s call the latter the level-2 dictionary. As shown above, the fields in the level-2 dictionary can be "AisVersion", "ApplicationID", "Cog", etc. The valid fields are specified here.

I don’t need the key for the level-1 dictionary because the level-2 dictionary contains a MesssageID field that much less ambiguously maps to the message type. Furthermore, dataframe dfAIS also has a MessageType column not shown above that contains the same label as the sole key in the level-1 dictionary.

I’ve been educating myself on dataframe manipulations using apply to extract the MessageID from the level-2 dictionary. I also found that json_normalize is a great option. Unfortunately, in nested dictionaries, it requires that the hierarchical path to common fields have the same path components. I cannot use it for the above scenario because Message.ShipStaticData.MessageID is a different path from Message.MultiSlotBinaryMessage.MessageID. In both of these paths, the 2nd of the 3 path components is the extraneous mapping layer that I wish didn’t exist.

How can I un-nest the nested dictionary so that I can access the MessageID field?

I tried using pandas.DataFrame.apply to apply the dictionary method items() to the elements of dfAIS column Message. It yields a series of nested tuples and
the extraneous message type key is still present:

df = dfAIS.Message.apply(dict.items)

0         ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
1         ((MultiSlotBinaryMessage, {'ApplicationID': {'...
2         ((StandardClassBPositionReport, {'AssignedMode...
3         ((PositionReport, {'Cog': 25.2, 'Communication...
4         ((StaticDataReport, {'MessageID': 24, 'PartNum...
                       
139625    ((PositionReport, {'Cog': 360, 'CommunicationS...
139626    ((PositionReport, {'Cog': 231.5, 'Communicatio...
139627    ((PositionReport, {'Cog': 360, 'CommunicationS...
139628                               ((UnknownMessage, {}))
139629    ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
Name: Message, Length: 139630, dtype: object

P.S. I was also wrestling with how to refer to the labels "ShipStaticData", "MultiSlotBinaryMessage", etc. They are the keys of the level-1 dictionary. It would have been convenient to talk about the "value" taken on by the level-1 key, but "value" already refers to the thing that is mapped to by the key. We can very carefully say that the latter is the value associated with the key, or for the key, but it is still asking for confusion. Is there a clearer (and succinct) way to refer to "ShipStaticData", "MultiSlotBinaryMessage", etc.?

2

Answers


  1. Chosen as BEST ANSWER

    In the posted question, the dataframe dfAIS contains a Message field and a MetaData field, as well as a MessageType field that I described as redundant.

    Each of the Message and MessageType fields contains a (possibly nested) dictionary. I want the fields of the inner most dictionaries to be the columns of the dataframe. Here is an example record of the source JSON file:

    {"Message": {"ShipStaticData": {"AisVersion": 1,
                                    "CallSign": "CQUR   ",
                                    "Destination": "DELBC               ",
                                    <...snip...>
                                   }
                },
     "MessageType": "ShipStaticData",
     "MetaData": {"MMSI": 255814000,
                  "MMSI_String": 255814000,
                  "ShipName": "CORONA SEA          ",
                  "latitude": 53.89848666666667,
                  "longitude": 10.781958333333334,
                  "time_utc": "2024-04-15 23:00:24.770932839 +0000 UTC"
                 }
    }
    

    The ShipStaticData and MetaData fields above contain the two inner-most dictionaries whose fields I want to become the new columns of the dataframe. The challenging dictionary is ShipStaticData because (i) it is nested within a 1-item Message dictionary, i.e., more deeply than the MetaData dictionary and (ii) it goes by names other than ShipStaticData. I just want the fields and values of the dictionary -- I don't care whether it is called ShipStaticData, nor that it sometimes goes by different names. Nor do I care about the name MetaData -- I just want the dictionaries fields and values.

    In my case, the JSON file contains multiple lines/records and furthermore, is gzipped. The first half of the code below reads all the records into a list of dictionaries while the 2nd half takes the fields of the ShipStaticData and MetaData dictionaries and turns the fields into dataframe columns:

    AISmsgs = [] # List of dictionaries, 1 per record
    fpINgz="data/20240415/ais_messages_185928.662241.json.gz"
    with gzip.open(fpINgz,'rt',encoding='UTF-8') as zipfile:
       for JsonObj in zipfile:
          AISmsg1 = json.loads(JsonObj) # 1 record
          AISmsgs.append(AISmsg1)
    
    dfMsg = pd.DataFrame([
       next(iter( rec1['Message'].values() )) | rec1['MetaData']
       for rec1 in AISmsgs # Each record
    ])
    

    The | operator concatenates the ShipStaticData and MetaData dictionaries. Expressions rec1['Message'] and rec1['MetaData'] simply accesses the two dictionaries for each record. The Message dictionary requires more processing it is the inner ShipStaticData dictionary that I want:

    • rec1['Message'] returns a dictionary in which ShipStaticData is the sole key
    • rec1['Message'].values() returns the "value" for the ShipStaticData key, which itself is one of the two innermost dictionaries whose keys I want to turn into dataframe columns dictionary
    • However, rec1['Message'].values() is of <class 'dict_values'>, "which does not support indexing [so I can't actually access the value]. But it's iterable, it can be converted to list, which supports indexing" (see here)

    The following has the effect of converting rec1['Message'].values() to a single-item list containing the value corresponding to the key ShipStaticData, i.e., the dictionary whose fields I want to become dataframe columns:

    list( AISmsgs[0]['Message'].values() )[0] # Simplest
    next(iter( AISmsgs[0]['Message'].values() )) # Most efficient
    

    Here is a sanity check of the correctness of converting data from the list of nested dictionaries AISmsgs to dataframe dfMsg based on the first record:

    AISmsgs[0]
    
       {'Message': {'ShipStaticData': {'AisVersion': 1,
          'CallSign': 'CQUR   ',
          'Destination': 'DELBC               ',
          'Dimension': {'A': 26, 'B': 161, 'C': 13, 'D': 13},
          'Dte': False,
          'Eta': {'Day': 15, 'Hour': 16, 'Minute': 0, 'Month': 4},
          'FixType': 1,
          'ImoNumber': 9357597,
          'MaximumStaticDraught': 6.3,
          'MessageID': 5,
          'Name': 'CORONA SEA          ',
          'RepeatIndicator': 0,
          'Spare': False,
          'Type': 71,
          'UserID': 255814000,
          'Valid': True}},
        'MessageType': 'ShipStaticData',
        'MetaData': {'MMSI': 255814000,
         'MMSI_String': 255814000,
         'ShipName': 'CORONA SEA          ',
         'latitude': 53.89848666666667,
         'longitude': 10.781958333333334,
         'time_utc': '2024-04-15 23:00:24.770932839 +0000 UTC'}}
    
    dfMsg.iloc[0]
    
       AisVersion                                          1.0
       CallSign                                        CQUR
       Destination                        DELBC
       Dimension         {'A': 26, 'B': 161, 'C': 13, 'D': 13}
       Dte                                               False
       Eta    {'Day': 15, 'Hour': 16, 'Minute': 0, 'Month': 4}
       FixType                                             1.0
       ImoNumber                                     9357597.0
       MaximumStaticDraught                                6.3
       MessageID                                           5.0
       Name                               CORONA SEA
       RepeatIndicator                                     0.0
       Spare                                             False
       Type                                               71.0
       UserID                                      255814000.0
       Valid                                              True
       MMSI                                          255814000
       MMSI_String                                   255814000
       ShipName                           CORONA SEA
       latitude                                      53.898487
       longitude                                     10.781958
       time_utc        2024-04-15 23:00:24.770932839 +0000 UTC
    
       ApplicationID                                       NaN
       ApplicationIDValid                                  NaN
       CommunicationState                                  NaN
        <...snip...>
       AltFromBaro                                         NaN
       Altitude                                            NaN
       Text                                                NaN
       Name: 0, dtype: object
    

    At the very bottom is a long series of mysterious fields containing NaN. The first record in the source data file contains none of these fields. I tracked this down to the fact that not all of the records follow the data model shown at the top of this answer. To avoid this inconsistency, I need to find a way to add a column to the original data set containing MessageID, which I will filter on. That has strong prospects of leaving records that have the same data model.


  2. People often forget that JSON can be transformed into a dictionary, which is an ordinary data construct in Python and that they can manipulate it before initializing the dataframe. Not everything needs to be handled by pandas.

    A little loop before sending it to pandas can make your life a lot easier.

    with open("data.json") as fp:
        data = json.load(fp)
    
    messages = [
        value 
        for message in data["Message"]
        for _, value in message.items()
    ]
    df = pd.DataFrame(messages)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search