Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Use json_normalize with extraneous dictionary layer with single key?

user2153235
May 6, 2024
188 views
0 votes
2 Answers

I have automated identification system (AIS) data as nested dictionaries in a pandas dataframe. Here is an example:

dfAIS
Out[18]:
                                                  Message  ...                                           MetaData
0       {'ShipStaticData': {'AisVersion': 1, 'CallSign...  ...  {'MMSI': 255814000, 'MMSI_String': 255814000, ...
1       {'MultiSlotBinaryMessage': {'ApplicationID': {...  ...  {'MMSI': 2276003, 'MMSI_String': 2276003, 'Shi...
2       {'StandardClassBPositionReport': {'AssignedMod...  ...  {'MMSI': 503760500, 'MMSI_String': 503760500, ...
3       {'PositionReport': {'Cog': 25.2, 'Communicatio...  ...  {'MMSI': 211648000, 'MMSI_String': 211648000, ...
4       {'StaticDataReport': {'MessageID': 24, 'PartNu...  ...  {'MMSI': 338467989, 'MMSI_String': 338467989, ...
                                                  ...  ...                                                ...
139625  {'PositionReport': {'Cog': 360, 'Communication...  ...  {'MMSI': 244730300, 'MMSI_String': 244730300, ...
139626  {'PositionReport': {'Cog': 231.5, 'Communicati...  ...  {'MMSI': 219025528, 'MMSI_String': 219025528, ...
139627  {'PositionReport': {'Cog': 360, 'Communication...  ...  {'MMSI': 273252100, 'MMSI_String': 273252100, ...
139628                             {'UnknownMessage': {}}  ...  {'MMSI': 244730043, 'MMSI_String': 244730043, ...
139629  {'ShipStaticData': {'AisVersion': 1, 'CallSign...  ...  {'MMSI': 211666470, 'MMSI_String': 211666470, ...

[139630 rows x 3 columns]

The core data resides in column "Message". Each element therein is a dictionary with only 1 key and 1 value. As shown above, the one key might be "ShipStaticData", "MultiSlotBinaryMessage", etc., which we can think of as message types. Let’s call this dictionary the level-1 dictionary.

The level-1 dictionary is an extraneous layer of mapping because the desired data resides in the corresponding value, which itself is an entire dictionary. Let’s call the latter the level-2 dictionary. As shown above, the fields in the level-2 dictionary can be "AisVersion", "ApplicationID", "Cog", etc. The valid fields are specified here.

I don’t need the key for the level-1 dictionary because the level-2 dictionary contains a MesssageID field that much less ambiguously maps to the message type. Furthermore, dataframe dfAIS also has a MessageType column not shown above that contains the same label as the sole key in the level-1 dictionary.

I’ve been educating myself on dataframe manipulations using apply to extract the MessageID from the level-2 dictionary. I also found that json_normalize is a great option. Unfortunately, in nested dictionaries, it requires that the hierarchical path to common fields have the same path components. I cannot use it for the above scenario because Message.ShipStaticData.MessageID is a different path from Message.MultiSlotBinaryMessage.MessageID. In both of these paths, the 2nd of the 3 path components is the extraneous mapping layer that I wish didn’t exist.

How can I un-nest the nested dictionary so that I can access the MessageID field?

I tried using pandas.DataFrame.apply to apply the dictionary method items() to the elements of dfAIS column Message. It yields a series of nested tuples and
the extraneous message type key is still present:

df = dfAIS.Message.apply(dict.items)

0         ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
1         ((MultiSlotBinaryMessage, {'ApplicationID': {'...
2         ((StandardClassBPositionReport, {'AssignedMode...
3         ((PositionReport, {'Cog': 25.2, 'Communication...
4         ((StaticDataReport, {'MessageID': 24, 'PartNum...
                       
139625    ((PositionReport, {'Cog': 360, 'CommunicationS...
139626    ((PositionReport, {'Cog': 231.5, 'Communicatio...
139627    ((PositionReport, {'Cog': 360, 'CommunicationS...
139628                               ((UnknownMessage, {}))
139629    ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
Name: Message, Length: 139630, dtype: object

P.S. I was also wrestling with how to refer to the labels "ShipStaticData", "MultiSlotBinaryMessage", etc. They are the keys of the level-1 dictionary. It would have been convenient to talk about the "value" taken on by the level-1 key, but "value" already refers to the thing that is mapped to by the key. We can very carefully say that the latter is the value associated with the key, or for the key, but it is still asking for confusion. Is there a clearer (and succinct) way to refer to "ShipStaticData", "MultiSlotBinaryMessage", etc.?

Answers

Chosen as BEST ANSWER

In the posted question, the dataframe dfAIS contains a Message field and a MetaData field, as well as a MessageType field that I described as redundant.

Each of the Message and MessageType fields contains a (possibly nested) dictionary. I want the fields of the inner most dictionaries to be the columns of the dataframe. Here is an example record of the source JSON file:

{"Message": {"ShipStaticData": {"AisVersion": 1,
                                "CallSign": "CQUR   ",
                                "Destination": "DELBC               ",
                                <...snip...>
                               }
            },
 "MessageType": "ShipStaticData",
 "MetaData": {"MMSI": 255814000,
              "MMSI_String": 255814000,
              "ShipName": "CORONA SEA          ",
              "latitude": 53.89848666666667,
              "longitude": 10.781958333333334,
              "time_utc": "2024-04-15 23:00:24.770932839 +0000 UTC"
             }
}

The ShipStaticData and MetaData fields above contain the two inner-most dictionaries whose fields I want to become the new columns of the dataframe. The challenging dictionary is ShipStaticData because (i) it is nested within a 1-item Message dictionary, i.e., more deeply than the MetaData dictionary and (ii) it goes by names other than ShipStaticData. I just want the fields and values of the dictionary -- I don't care whether it is called ShipStaticData, nor that it sometimes goes by different names. Nor do I care about the name MetaData -- I just want the dictionaries fields and values.

In my case, the JSON file contains multiple lines/records and furthermore, is gzipped. The first half of the code below reads all the records into a list of dictionaries while the 2nd half takes the fields of the ShipStaticData and MetaData dictionaries and turns the fields into dataframe columns:

AISmsgs = [] # List of dictionaries, 1 per record
fpINgz="data/20240415/ais_messages_185928.662241.json.gz"
with gzip.open(fpINgz,'rt',encoding='UTF-8') as zipfile:
   for JsonObj in zipfile:
      AISmsg1 = json.loads(JsonObj) # 1 record
      AISmsgs.append(AISmsg1)

dfMsg = pd.DataFrame([
   next(iter( rec1['Message'].values() )) | rec1['MetaData']
   for rec1 in AISmsgs # Each record
])

The | operator concatenates the ShipStaticData and MetaData dictionaries. Expressions rec1['Message'] and rec1['MetaData'] simply accesses the two dictionaries for each record. The Message dictionary requires more processing it is the inner ShipStaticData dictionary that I want:

rec1['Message'] returns a dictionary in which ShipStaticData is the sole key
rec1['Message'].values() returns the "value" for the ShipStaticData key, which itself is one of the two innermost dictionaries whose keys I want to turn into dataframe columns dictionary
However, rec1['Message'].values() is of <class 'dict_values'>, "which does not support indexing [so I can't actually access the value]. But it's iterable, it can be converted to list, which supports indexing" (see here)

The following has the effect of converting rec1['Message'].values() to a single-item list containing the value corresponding to the key ShipStaticData, i.e., the dictionary whose fields I want to become dataframe columns:

list( AISmsgs[0]['Message'].values() )[0] # Simplest
next(iter( AISmsgs[0]['Message'].values() )) # Most efficient

Here is a sanity check of the correctness of converting data from the list of nested dictionaries AISmsgs to dataframe dfMsg based on the first record:

AISmsgs[0]

   {'Message': {'ShipStaticData': {'AisVersion': 1,
      'CallSign': 'CQUR   ',
      'Destination': 'DELBC               ',
      'Dimension': {'A': 26, 'B': 161, 'C': 13, 'D': 13},
      'Dte': False,
      'Eta': {'Day': 15, 'Hour': 16, 'Minute': 0, 'Month': 4},
      'FixType': 1,
      'ImoNumber': 9357597,
      'MaximumStaticDraught': 6.3,
      'MessageID': 5,
      'Name': 'CORONA SEA          ',
      'RepeatIndicator': 0,
      'Spare': False,
      'Type': 71,
      'UserID': 255814000,
      'Valid': True}},
    'MessageType': 'ShipStaticData',
    'MetaData': {'MMSI': 255814000,
     'MMSI_String': 255814000,
     'ShipName': 'CORONA SEA          ',
     'latitude': 53.89848666666667,
     'longitude': 10.781958333333334,
     'time_utc': '2024-04-15 23:00:24.770932839 +0000 UTC'}}

dfMsg.iloc[0]

   AisVersion                                          1.0
   CallSign                                        CQUR
   Destination                        DELBC
   Dimension         {'A': 26, 'B': 161, 'C': 13, 'D': 13}
   Dte                                               False
   Eta    {'Day': 15, 'Hour': 16, 'Minute': 0, 'Month': 4}
   FixType                                             1.0
   ImoNumber                                     9357597.0
   MaximumStaticDraught                                6.3
   MessageID                                           5.0
   Name                               CORONA SEA
   RepeatIndicator                                     0.0
   Spare                                             False
   Type                                               71.0
   UserID                                      255814000.0
   Valid                                              True
   MMSI                                          255814000
   MMSI_String                                   255814000
   ShipName                           CORONA SEA
   latitude                                      53.898487
   longitude                                     10.781958
   time_utc        2024-04-15 23:00:24.770932839 +0000 UTC

   ApplicationID                                       NaN
   ApplicationIDValid                                  NaN
   CommunicationState                                  NaN
    <...snip...>
   AltFromBaro                                         NaN
   Altitude                                            NaN
   Text                                                NaN
   Name: 0, dtype: object

At the very bottom is a long series of mysterious fields containing NaN. The first record in the source data file contains none of these fields. I tracked this down to the fact that not all of the records follow the data model shown at the top of this answer. To avoid this inconsistency, I need to find a way to add a column to the original data set containing MessageID, which I will filter on. That has strong prospects of leaving records that have the same data model.

(Edit)

- CodeDifferent
- May 4, 2024 at 2:36 am
- 0 votes
0
People often forget that JSON can be transformed into a dictionary, which is an ordinary data construct in Python and that they can manipulate it before initializing the dataframe. Not everything needs to be handled by pandas.

A little loop before sending it to pandas can make your life a lot easier.
```
with open("data.json") as fp:
    data = json.load(fp)

messages = [
    value 
    for message in data["Message"]
    for _, value in message.items()
]
df = pd.DataFrame(messages)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.