skip to Main Content

I have many JSON files with the following structure:

{
  "requestId": "test",
  "executionDate": "2023-05-10",
  "executionTime": "12:02:22",
  "request": {
    "fields": [{
      "geometry": {
        "type": "Point",
        "coordinates": [-90, 41]
      },
      "colour": "blue",
      "bean": "blaCk",
      "birthday": "2021-01-01",
      "arst": "111",
      "arstg": "rst",
      "fct": {
        "start": "2011-01-10",
        "end": "2012-01-10"
      }
    }]
  },
  "response": {
    "results": [{
        "geom": {
          "type": "geo",
          "coord": [-90, 41]
        },
        "md": {
          "type": "arstat",
          "mdl": "trstr",
          "vs": "v0",
          "cal": {
            "num": 4,
            "comment": "message"
          },
          "bean": ["blue", "green"],
          "result_time": 12342
        },
        "predictions": [{
            "date": "2004-05-19",
            "day": 0,
            "count": 0,
            "eating_stage": "trt"
          }, {
            "date": "2002-01-20",
            "day": 1,
            "count": 0,
            "eating_stage": "arstg"
          }, {
            "date": "2004-05-21",
            "day": 2,
            "count": 0,
            "eating_stage": "strg"
          }, {
            "date": "2004-05-22",
            "day": 3,
            "count": 0,
            "eating_stage": "rst"
          }
        }
      }
    }

The predictions part can be very deep. I want to convert this JSON to a CSV with the following structure:

requestId executionDate executionTime colour predictions_date predictions_day predictions_count predictions_eating_stage
test 2023-05-10 12:02:22 blue 2004-05-19 0 0 trt
test 2023-05-10 12:02:22 blue 2002-01-20 1 0 astrg
test 2023-05-10 12:02:22 blue 2004-05-21 2 0 strg
test 2023-05-10 12:02:22 blue 2004-05-22 3 0 rst

I tried the following code:

flat_json = pd.DataFrame(
    flatten(json_data), index=[0]
)

The code results in every data point becoming a column, and I am not sure how to pivot longer where at the ‘predictions’ key using JSON functions in Python. I recognise that at this stage I could pivot longer using column names, but I feel like there is a cleaner way to achieve this.

2

Answers


  1. I would suggest simply extracting what you need. It seems very specific for it to be solved using specific parsing. Therefore I would start by creating two dataframes:

    df_prediction = pd.DataFrame(example['response']['results'][0]['predictions'])
    df_data = pd.DataFrame({x:y for x,y in example.items() if type(y)==str},index=[0]) 
    

    Renaming columns in predictions:

    df_prediction.columns = ['prediction_'+x for x in df_prediction]
    

    Joining and adding the last piece of data (colour):

    output = df_data.assign(colour = example['request']['fields'][0]['colour']).join(df_prediction,how='right').ffill()
    

    Outputting:

      requestId executionDate  ... prediction_count prediction_eating_stage
    0      test    2023-05-10  ...                0                     trt
    1      test    2023-05-10  ...                0                   arstg
    2      test    2023-05-10  ...                0                    strg
    3      test    2023-05-10  ...                0                     rst
    
    Login or Signup to reply.
  2. You can also use json_normalize to extract the array of records that you want to normalize into a csv.

    >>> df_predictions = pd.json_normalize(json_data,record_path=['response', 'results','predictions'], record_prefix='predictions.', meta=['requestId', 'executionDate', 'executionTime']).assign(colour = json_data['request']['fields'][0]['colour'])
    >>> df_predictions
      predictions.date  predictions.day  predictions.count  ... executionDate executionTime colour
    0       2004-05-19                0                  0  ...    2023-05-10      12:02:22   blue
    1       2002-01-20                1                  0  ...    2023-05-10      12:02:22   blue
    2       2004-05-21                2                  0  ...    2023-05-10      12:02:22   blue
    3       2004-05-22                3                  0  ...    2023-05-10      12:02:22   blue
    
    [4 rows x 8 columns]
    

    It is just unfortunate that there is a limitation on the meta fields as it is throwing an exception for a path that includes an array / list so the "colour" column was added separately. If the order is important, then you can rearrange the columns as needed.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search