skip to Main Content

How to convert a pyspark data frame like below to a json array structure


OrderID   field              fieldValue   itemSeqNo 

123       Date               01-01-23      1
123       Amount             10.00         1 
123       description        Pencil        1
123       Date               01-02-23      2
123       Amount             11.00         2
123       description        Pen           2  

Into below json array structure

        {
           "orderDetails": {
           "orderID": "123"
                          },
           "itemizationDetails": [
               {
                "Date": "01-01-23",
                "Amount": "10.00",
                "description": "Pencil"
               },
               {
                 "Date": "01-02-23 ",
                "Amount": "11.00",
               "description": "Pen"
               }
                                 ]
         }




This is the current code I have and out put is not as expected .

      import pandas as pd 

      test_dataframe = pd.DataFrame(
     {
      "OrderID" : ['123','123','123','123','123','123'],
      "field" : 
     ["Date","Amount",'description','Date','Amount','description'],
       "fieldValue": ['01-01-23','10.00','Pencil','01-02-23 
     ','11.00','Pen '],
        "itemSeqNo" : ['1','1','1','2','2','2']

        }
       )
      import json
      res = json.loads(test_dataframe.to_json(orient='records'))
      print(res)


[{'OrderID': '123', 'field': 'Date', 'fieldValue': '01-01-23', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'Amount', 'fieldValue': '10.00', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'description', 'fieldValue': 'Pencil', 'itemSeqNo': '1'}, {'OrderID': '123', 'field': 'Date', 'fieldValue': '01-02-23 ', 'itemSeqNo': '2'}, {'OrderID': '123', 'field': 'Amount', 'fieldValue': '11.00', 'itemSeqNo': '2'}, {'OrderID': '123', 'field': 'description', 'fieldValue': 'Pen ', 'itemSeqNo': '2'}]

2

Answers


  1. You can convert it with pandas easily

    Login or Signup to reply.
  2. Pyspark Solution

    Pivot to reshape the dataframe

    df1 = df.groupby('OrderID', 'itemSeqNo').pivot('field').agg(F.first('fieldValue'))
    
    # +-------+---------+------+---------+-----------+
    # |OrderID|itemSeqNo|Amount|     Date|description|
    # +-------+---------+------+---------+-----------+
    # |    123|        1| 10.00| 01-01-23|     Pencil|
    # |    123|        2| 11.00|01-02-23 |       Pen |
    # +-------+---------+------+---------+-----------+
    

    Pack the required columns into struct type

    df1 = df1.withColumn('itemizationDetails', F.struct('Amount', 'Date', 'description'))
    
    # +-------+---------+------+---------+-----------+-------------------------+
    # |OrderID|itemSeqNo|Amount|Date     |description|itemizationDetails       |
    # +-------+---------+------+---------+-----------+-------------------------+
    # |123    |1        |10.00 |01-01-23 |Pencil     |{10.00, 01-01-23, Pencil}|
    # |123    |2        |11.00 |01-02-23 |Pen        |{11.00, 01-02-23 , Pen } |
    # +-------+---------+------+---------+-----------+-------------------------+
    

    Group the dataframe by OrderID and collect list of structs

    df1 = df1.groupby('OrderID').agg(F.collect_list('itemizationDetails').alias('itemizationDetails'))
    
    # +-------+-----------------------------------------------------+
    # |OrderID|itemizationDetails                                   |
    # +-------+-----------------------------------------------------+
    # |123    |[{10.00, 01-01-23, Pencil}, {11.00, 01-02-23 , Pen }]|
    

    Pack OrderID into struct field

    df1 = df1.withColumn('OrderDetails', F.struct('OrderID'))
    
    # +-------+--------------------+------------+
    # |OrderID|  itemizationDetails|OrderDetails|
    # +-------+--------------------+------------+
    # |    123|[{10.00, 01-01-23...|       {123}|
    # +-------+--------------------+------------+
    

    Export the dataframe to JSON

    result = df1.select('OrderDetails', 'itemizationDetails').toJSON().collect()
    
    
    ['{"OrderDetails":{"OrderID":"123"},"itemizationDetails":[{"Amount":"10.00","Date":"01-01-23","description":"Pencil"},{"Amount":"11.00","Date":"01-02-23 ","description":"Pen "}]}']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search