I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet)
Schema
root
|-- : string (nullable = true)
sample data
+----------------------------------------------+
+----------------------------------------------+
|{"deviceTypeId":"A2A","deviceId":"123","geo...|
|{"deviceTypeId":"A2B","deviceId":"456","geo...|
+----------------------------------------------+
Expected Output
+--------------+------------+
| deviceTypeId|deviceId|...|
+--------------+--------+---+
| A2A| 123| |
| A2B| 456| |
+--------------+--------+---+
I tried splitting the string, but this doesn’t seem like an efficient approach
split_col = split(df_explode[''], ',')
And then extract the columns, but it appends the initial string as well.
df_1 = df_explode.withColumn('deviceId',split_col.getItem(1))
# df_1 = df_explode.withColumn('deviceTypeId',split_col.getItem(0))
printOutput(df_1)
I’m looking for a better way to solve this problem
2
Answers
I was able to do it using the from_json function.
We need to create Json Schema that will parse the Json data.
The value string is empty in this Json data so the col consists of empty string
Explode function is only to work on Array.
In your case which is a json, you should use from_json function.
Please refer from_json from pyspark.sql.functions