skip to Main Content

I have below JSON string as a column in a pyspark dataframe.

{
   "result":{
      "version":"1.2",
      "timeStamp":"2023-08-14 14:00:12",
      "description":"",
      "data":{
         "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
         "DateTime_Actual":"2023-08-14T14:00:12",
         "OtherInfo":null,
         "main":[
            {
               "Status":0,
               "ID":111,
               "details":null
            }
         ]
      },
      "tn":"aaa"
   }
}

I want to explode the above one into multiple columns without hardcoding the schema.

I tried using schema_of_json to generate schema from the json string.

df_decoded = df_decoded.withColumn("json_column", F.when(F.col("value").isNotNull(), F.col("value")).otherwise("{}"))

# Infer the schema using schema_of_json
json_schema = df_decoded.select(F.schema_of_json(F.col("json_column"))).collect()[0][0]

df_decoded is my dataframe and value is my json string column name.

But it is giving me the below error –

AnalysisException: cannot resolve 'schema_of_json(json_column)' due to data type mismatch: The input json should be a foldable string expression and not null; however, got json_column.;

My expected output –
enter image description here

2

Answers


  1. Does this start you on the way ?

    import json
    import pandas as pd
    
    j = '''{
       "result":{
          "version":"1.2",
          "timeStamp":"2023-08-14 14:00:12",
          "description":"",
          "data":{
             "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
             "DateTime_Actual":"2023-08-14T14:00:12",
             "OtherInfo":null,
             "main":[
                {
                   "Status":0,
                   "ID":111,
                   "details":null
                }
             ]
          },
          "tn":"aaa"
       }
    }'''
    
    
    text_json = json.loads(j)
    result=text_json.get("result", "")
    print(result.get("version", ""))
    
    results = [result["version"], result["timeStamp"], result["description"], result["data"], result["tn"] ]
    df = pd.DataFrame(results).transpose()
    print(df)
    

    I don’t have a real app to play with
    .transpose() is the change.

    https://stackoverflow.com/a/77263073/22187484
    This person has a complex answer for grouping and filtering that might help too.

    Login or Signup to reply.
  2. Use sparks inference engine to get the schema of json column then cast the json column to struct then use select expression to explode the struct fields as columns

    schema = spark.read.json(df.rdd.map(lambda r: r['value'])).schema
    result = df.withColumn('value', F.from_json('value', schema)).select('*', 'value.result.*')
    

    +--------------------+--------------------+-----------+-------------------+---+-------+
    |               value|                data|description|          timeStamp| tn|version|
    +--------------------+--------------------+-----------+-------------------+---+-------+
    |{{{2023-08-14T14:...|{2023-08-14T14:00...|           |2023-08-14 14:00:12|aaa|    1.2|
    +--------------------+--------------------+-----------+-------------------+---+-------+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search