skip to Main Content

I have a JSON file with a field named "AdditionalFields" as below-
"additionalFields":

[
   {
      "fieldName":"customer_name",
      "fieldValue":"ABC"
   },
   {
      "fieldName":"deviceid",
      "fieldValue":"1234"
   },
   {
      "fieldName":"txn_id",
      "fieldValue":"2"
   },
   {
      "fieldName":"txn_date",
      "fieldValue":"2017-08-14T18:17:37"
   },
   {
      "fieldName":"orderid",
      "fieldValue":"I126101"
   }
]

How to parse this as separate columns? eg customer name to be a column and ABC should be the value.

Tried to parse this as an ArrayType but getting multiple rows for columns "FieldName" and "FieldValue".
Want to get each item under FieldName to be a column & FieldValue to be the respective column value.

2

Answers


  1. I hope this is how your sample data looks like:

    [
    {"fieldName":"customer_name","fieldValue":"ABC"},
    {"fieldName":"deviceid","fieldValue":"1234"},
    {"fieldName":"txn_id","fieldValue":"2"},
    {"fieldName":"txn_date","fieldValue":"2017-08-14T18:17:37"},
    {"fieldName":"orderid","fieldValue":"I126101"}
    ]
    
    1. Read the sample JSON file using Spark read method
    df=spark.read.options(multiLine=True).json("path/to/sample.json")
    
    1. This function extracts data column wise
    def return_result(df, column):
        return df.select(column).rdd.map(lambda row: row[f"{column}"]).collect()
    
    1. Using the above function on all the columns
    records = [return_result(df, field) for field in df.columns]
    

    This is how it looks like:

    [['customer_name', 'deviceid', 'txn_id', 'txn_date', 'orderid'], ['ABC', '1234', '2', '2017-08-14T18:17:37', 'I126101']]
    
    1. The first record is the header, and the rest of them are data
    columns, data = records[0], records[1:]
    
    1. Use the colelcted and create a DataFrame out of it
    converted_df = sc.parallelize(data).toDF(columns)
    converted_df.show()
    

    Output:

    +-------------+--------+------+-------------------+-------+
    |customer_name|deviceid|txn_id|           txn_date|orderid|
    +-------------+--------+------+-------------------+-------+
    |          ABC|    1234|     2|2017-08-14T18:17:37|I126101|
    +-------------+--------+------+-------------------+-------+
    
    Login or Signup to reply.
  2. Depending on the size of your JSON, you can also open it using the json library and create the DataFrame data by working on the dictionaries:

    # Assuming you can load the data using the json python library.
    data = [
        {"fieldName": "customer_name", "fieldValue": "ABC"},
        {"fieldName": "deviceid", "fieldValue": "1234"},
        {"fieldName": "txn_id", "fieldValue": "2"},
        {"fieldName": "txn_date", "fieldValue": "2017-08-14T18:17:37"},
        {"fieldName": "orderid", "fieldValue": "I126101"},
    ]
    
    df_data = [{d["fieldName"]: d["fieldValue"]} for d in data]
    
    df = spark.createDataFrame(df_data)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search