Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Partially flatten nested JSON and pivot longer

prayner
June 1, 2023
279 views
0 votes
2 Answers

I have many JSON files with the following structure:

{
  "requestId": "test",
  "executionDate": "2023-05-10",
  "executionTime": "12:02:22",
  "request": {
    "fields": [{
      "geometry": {
        "type": "Point",
        "coordinates": [-90, 41]
      },
      "colour": "blue",
      "bean": "blaCk",
      "birthday": "2021-01-01",
      "arst": "111",
      "arstg": "rst",
      "fct": {
        "start": "2011-01-10",
        "end": "2012-01-10"
      }
    }]
  },
  "response": {
    "results": [{
        "geom": {
          "type": "geo",
          "coord": [-90, 41]
        },
        "md": {
          "type": "arstat",
          "mdl": "trstr",
          "vs": "v0",
          "cal": {
            "num": 4,
            "comment": "message"
          },
          "bean": ["blue", "green"],
          "result_time": 12342
        },
        "predictions": [{
            "date": "2004-05-19",
            "day": 0,
            "count": 0,
            "eating_stage": "trt"
          }, {
            "date": "2002-01-20",
            "day": 1,
            "count": 0,
            "eating_stage": "arstg"
          }, {
            "date": "2004-05-21",
            "day": 2,
            "count": 0,
            "eating_stage": "strg"
          }, {
            "date": "2004-05-22",
            "day": 3,
            "count": 0,
            "eating_stage": "rst"
          }
        }
      }
    }

The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:

requestId	executionDate	executionTime	colour	predictions_date	predictions_day	predictions_eating_stage
test	2023-05-10	12:02:22	blue	2004-05-19	0	trt
test	2023-05-10	12:02:22	blue	2002-01-20	1	astrg
test	2023-05-10	12:02:22	blue	2004-05-21	2	strg
test	2023-05-10	12:02:22	blue	2004-05-22	3	rst

I tried the following code:

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

The code results in every data point becoming a column, and I am not sure how to pivot longer where at the ‘predictions’ key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.

Answers

I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:

df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0])

Renaming columns in predictions:

df_prediction.columns = ['prediction_'+x for x in df_prediction]

Joining and adding the last piece of data (colour):

output = df_data.assign(colour = example['request']['fields'][0]['colour']).join(df_prediction,how='right').ffill()

Outputting:

  requestId executionDate  ... prediction_count prediction_eating_stage
0      test    2023-05-10  ...                0                     trt
1      test    2023-05-10  ...                0                   arstg
2      test    2023-05-10  ...                0                    strg
3      test    2023-05-10  ...                0                     rst

You can also use json_normalize to extract the array of records that you want to normalize into a csv.

>>> df_predictions = pd.json_normalize(json_data,record_path=['response', 'results','predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour = json_data['request']['fields'][0]['colour'])
>>> df_predictions
  predictions.date  predictions.day  predictions.count  ... executionDate executionTime colour
0       2004-05-19                0                  0  ...    2023-05-10      12:02:22   blue
1       2002-01-20                1                  0  ...    2023-05-10      12:02:22   blue
2       2004-05-21                2                  0  ...    2023-05-10      12:02:22   blue
3       2004-05-22                3                  0  ...    2023-05-10      12:02:22   blue

[4 rows x 8 columns]

It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.

Please signup or login to give your own answer.

Click here to cancel reply.