Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Pyspark – JSON string column explode into multiple without mentioning schema

priyanka
October 10, 2023
100 views
0 votes
2 Answers

I have below JSON string as a column in a pyspark dataframe.

{
   "result":{
      "version":"1.2",
      "timeStamp":"2023-08-14 14:00:12",
      "description":"",
      "data":{
         "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
         "DateTime_Actual":"2023-08-14T14:00:12",
         "OtherInfo":null,
         "main":[
            {
               "Status":0,
               "ID":111,
               "details":null
            }
         ]
      },
      "tn":"aaa"
   }
}

I want to explode the above one into multiple columns without hardcoding the schema.

I tried using schema_of_json to generate schema from the json string.

df_decoded = df_decoded.withColumn("json_column", F.when(F.col("value").isNotNull(), F.col("value")).otherwise("{}"))

# Infer the schema using schema_of_json
json_schema = df_decoded.select(F.schema_of_json(F.col("json_column"))).collect()[0][0]

df_decoded is my dataframe and value is my json string column name.

But it is giving me the below error –

AnalysisException: cannot resolve 'schema_of_json(json_column)' due to data type mismatch: The input json should be a foldable string expression and not null; however, got json_column.;

My expected output –

Answers

Does this start you on the way ?

import json
import pandas as pd

j = '''{
   "result":{
      "version":"1.2",
      "timeStamp":"2023-08-14 14:00:12",
      "description":"",
      "data":{
         "DateTime_Received":"2023-08-14T14:01:10.4516457+01:00",
         "DateTime_Actual":"2023-08-14T14:00:12",
         "OtherInfo":null,
         "main":[
            {
               "Status":0,
               "ID":111,
               "details":null
            }
         ]
      },
      "tn":"aaa"
   }
}'''


text_json = json.loads(j)
result=text_json.get("result", "")
print(result.get("version", ""))

results = [result["version"], result["timeStamp"], result["description"], result["data"], result["tn"] ]
df = pd.DataFrame(results).transpose()
print(df)

I don’t have a real app to play with
.transpose() is the change.

https://stackoverflow.com/a/77263073/22187484
This person has a complex answer for grouping and filtering that might help too.

Use sparks inference engine to get the schema of json column then cast the json column to struct then use select expression to explode the struct fields as columns

schema = spark.read.json(df.rdd.map(lambda r: r['value'])).schema
result = df.withColumn('value', F.from_json('value', schema)).select('*', 'value.result.*')

+--------------------+--------------------+-----------+-------------------+---+-------+
|               value|                data|description|          timeStamp| tn|version|
+--------------------+--------------------+-----------+-------------------+---+-------+
|{{{2023-08-14T14:...|{2023-08-14T14:00...|           |2023-08-14 14:00:12|aaa|    1.2|
+--------------------+--------------------+-----------+-------------------+---+-------+

Please signup or login to give your own answer.

Click here to cancel reply.