I have automated identification system (AIS) data as nested dictionaries in a pandas dataframe. Here is an example:
dfAIS
Out[18]:
Message ... MetaData
0 {'ShipStaticData': {'AisVersion': 1, 'CallSign... ... {'MMSI': 255814000, 'MMSI_String': 255814000, ...
1 {'MultiSlotBinaryMessage': {'ApplicationID': {... ... {'MMSI': 2276003, 'MMSI_String': 2276003, 'Shi...
2 {'StandardClassBPositionReport': {'AssignedMod... ... {'MMSI': 503760500, 'MMSI_String': 503760500, ...
3 {'PositionReport': {'Cog': 25.2, 'Communicatio... ... {'MMSI': 211648000, 'MMSI_String': 211648000, ...
4 {'StaticDataReport': {'MessageID': 24, 'PartNu... ... {'MMSI': 338467989, 'MMSI_String': 338467989, ...
... ... ...
139625 {'PositionReport': {'Cog': 360, 'Communication... ... {'MMSI': 244730300, 'MMSI_String': 244730300, ...
139626 {'PositionReport': {'Cog': 231.5, 'Communicati... ... {'MMSI': 219025528, 'MMSI_String': 219025528, ...
139627 {'PositionReport': {'Cog': 360, 'Communication... ... {'MMSI': 273252100, 'MMSI_String': 273252100, ...
139628 {'UnknownMessage': {}} ... {'MMSI': 244730043, 'MMSI_String': 244730043, ...
139629 {'ShipStaticData': {'AisVersion': 1, 'CallSign... ... {'MMSI': 211666470, 'MMSI_String': 211666470, ...
[139630 rows x 3 columns]
The core data resides in column "Message". Each element therein is a dictionary with only 1 key and 1 value. As shown above, the one key might be "ShipStaticData", "MultiSlotBinaryMessage", etc., which we can think of as message types. Let’s call this dictionary the level-1 dictionary.
The level-1 dictionary is an extraneous layer of mapping because the desired data resides in the corresponding value, which itself is an entire dictionary. Let’s call the latter the level-2 dictionary. As shown above, the fields in the level-2 dictionary can be "AisVersion", "ApplicationID", "Cog", etc. The valid fields are specified here.
I don’t need the key for the level-1 dictionary because the level-2 dictionary contains a MesssageID field that much less ambiguously maps to the message type. Furthermore, dataframe dfAIS
also has a MessageType column not shown above that contains the same label as the sole key in the level-1 dictionary.
I’ve been educating myself on dataframe manipulations using apply to extract the MessageID from the level-2 dictionary. I also found that json_normalize is a great option. Unfortunately, in nested dictionaries, it requires that the hierarchical path to common fields have the same path components. I cannot use it for the above scenario because Message.ShipStaticData.MessageID
is a different path from Message.MultiSlotBinaryMessage.MessageID
. In both of these paths, the 2nd of the 3 path components is the extraneous mapping layer that I wish didn’t exist.
How can I un-nest the nested dictionary so that I can access the MessageID field?
I tried using pandas.DataFrame.apply
to apply the dictionary method items()
to the elements of dfAIS
column Message
. It yields a series of nested tuples and
the extraneous message type key is still present:
df = dfAIS.Message.apply(dict.items)
0 ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
1 ((MultiSlotBinaryMessage, {'ApplicationID': {'...
2 ((StandardClassBPositionReport, {'AssignedMode...
3 ((PositionReport, {'Cog': 25.2, 'Communication...
4 ((StaticDataReport, {'MessageID': 24, 'PartNum...
139625 ((PositionReport, {'Cog': 360, 'CommunicationS...
139626 ((PositionReport, {'Cog': 231.5, 'Communicatio...
139627 ((PositionReport, {'Cog': 360, 'CommunicationS...
139628 ((UnknownMessage, {}))
139629 ((ShipStaticData, {'AisVersion': 1, 'CallSign'...
Name: Message, Length: 139630, dtype: object
P.S. I was also wrestling with how to refer to the labels "ShipStaticData", "MultiSlotBinaryMessage", etc. They are the keys of the level-1 dictionary. It would have been convenient to talk about the "value" taken on by the level-1 key, but "value" already refers to the thing that is mapped to by the key. We can very carefully say that the latter is the value associated with the key, or for the key, but it is still asking for confusion. Is there a clearer (and succinct) way to refer to "ShipStaticData", "MultiSlotBinaryMessage", etc.?
2
Answers
In the posted question, the dataframe
dfAIS
contains aMessage
field and aMetaData
field, as well as aMessageType
field that I described as redundant.Each of the
Message
andMessageType
fields contains a (possibly nested) dictionary. I want the fields of the inner most dictionaries to be the columns of the dataframe. Here is an example record of the source JSON file:The
ShipStaticData
andMetaData
fields above contain the two inner-most dictionaries whose fields I want to become the new columns of the dataframe. The challenging dictionary isShipStaticData
because (i) it is nested within a 1-itemMessage
dictionary, i.e., more deeply than theMetaData
dictionary and (ii) it goes by names other thanShipStaticData
. I just want the fields and values of the dictionary -- I don't care whether it is calledShipStaticData
, nor that it sometimes goes by different names. Nor do I care about the nameMetaData
-- I just want the dictionaries fields and values.In my case, the JSON file contains multiple lines/records and furthermore, is gzipped. The first half of the code below reads all the records into a list of dictionaries while the 2nd half takes the fields of the
ShipStaticData
andMetaData
dictionaries and turns the fields into dataframe columns:The
|
operator concatenates theShipStaticData
andMetaData
dictionaries. Expressionsrec1['Message']
andrec1['MetaData']
simply accesses the two dictionaries for each record. TheMessage
dictionary requires more processing it is the innerShipStaticData
dictionary that I want:rec1['Message']
returns a dictionary in whichShipStaticData
is the sole keyrec1['Message'].values()
returns the "value" for theShipStaticData
key, which itself is one of the two innermost dictionaries whose keys I want to turn into dataframe columns dictionaryrec1['Message'].values()
is of<class 'dict_values'>
, "which does not support indexing [so I can't actually access the value]. But it's iterable, it can be converted to list, which supports indexing" (see here)The following has the effect of converting
rec1['Message'].values()
to a single-item list containing the value corresponding to the keyShipStaticData
, i.e., the dictionary whose fields I want to become dataframe columns:Here is a sanity check of the correctness of converting data from the list of nested dictionaries
AISmsgs
to dataframedfMsg
based on the first record:At the very bottom is a long series of mysterious fields containing
NaN
. The first record in the source data file contains none of these fields. I tracked this down to the fact that not all of the records follow the data model shown at the top of this answer. To avoid this inconsistency, I need to find a way to add a column to the original data set containingMessageID
, which I will filter on. That has strong prospects of leaving records that have the same data model.People often forget that JSON can be transformed into a dictionary, which is an ordinary data construct in Python and that they can manipulate it before initializing the dataframe. Not everything needs to be handled by pandas.
A little loop before sending it to pandas can make your life a lot easier.