I have a JSON string substitutions
as a column in dataframe which has multiple array elements that I want to explode and create a new row for each element present in that array. There are other columns present in the df
My dataframe looks like this:
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+
| requestid|sourcepage| cartid| tm| dt| customerId| usItemId|prefType| substitutions|
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+
|00-efbedfe05b4482...| CHECKOUT|808b44cc-1a38-4dd...|2023-04-25 00:07:...|2023-04-25|f1a34e16-a6d0-6f5...|862776084| NO_PREF|{"id":{"productId...|
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+
json string column substitutions
[
{
"id": {
"productId": "2N3UYGUTROQK",
"usItemId": "32667929"
},
"usItemId": "32667929",
"itemRank": 1,
"customerChoice": false
},
{
"id": {
"productId": "2N3UYGUTRHQK",
"usItemId": "32667429"
},
"usItemId": "32667429",
"itemRank": 2,
"customerChoice": true
},
{
"id": {
"productId": "2N3UYGUTRYQK",
"usItemId": "32667529"
},
"usItemId": "32667529",
"itemRank": 3,
"customerChoice": false
},
{
"id": {
"productId": "2N3UYGUTIQK",
"usItemId": "32667329"
},
"usItemId": "32667329",
"itemRank": 4,
"customerChoice": false
},
{"id": {
"productId": "2N3UYGUTYOQK",
"usItemId": "32663929"
},
"usItemId": "32663929",
"itemRank": 5,
"customerChoice": false
}
]
I have tried the below but not getting desired results
df.select("*", f.explode(f.from_json("substitutions", MapType(StringType(),StringType()))))
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+-------+
| requestid|sourcepage| cartid| tm| dt| customerId| usItemId|prefType| substitutions|entries|
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+-------+
|00-efbedfe05b4482...| CHECKOUT|808b44cc-1a38-4dd...|2023-04-25 00:07:...|2023-04-25|f1a34e16-a6d0-6f5...|862776084| NO_PREF|[{"id":{"productI...| null|
+--------------------+----------+--------------------+--------------------+----------+--------------------+---------+--------+--------------------+-------+
What mistake am I doing here?
2
Answers
Your issue is that you’re not properly defining the schema of your json.
Let’s start by making a simplified version of your
df
(you don’t need to do this):Now, in order to be able to parse in the json (for example using the
from_json
function like you’re doing), you need to define the correct schema to be used. After having done that, you canexplode
your dataframe:Is the
substitution
-column a string, or is it recognised as JSON by Spark?If it’s recognised as JSON you can simply remove the
from_json
:Otherwise you have to provide the entire schema of your JSON. This worked for me with the JSON you posted above: