I have a json that looks like this:
[
{
"event_date": "20221207",
"user_properties": [
{
"key": "user_id",
"value": {
"set_timestamp_micros": "1670450329209558"
}
},
{
"key": "doc_id",
"value": {
"set_timestamp_micros": "1670450329209558"
}
}
]
},
{
"event_date": "20221208",
"user_properties": [
{
"key": "account_id",
"value": {
"int_value": "3176465",
"set_timestamp_micros": "1670450323992556"
}
},
{
"key": "user_id",
"value": {
"string_value": "430fdfc579f55f9859173c1bea39713dc11c3ba62e83c24830e3d5936f43c26d",
"set_timestamp_micros": "1670450323992556"
}
}
]
}
]
When I read it using spark.read.json(JSON_PATH), I got the following schema:
root
|-- event_date: string (nullable = true)
|-- user_properties: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- int_value: string (nullable = true)
| | | |-- set_timestamp_micros: string (nullable = true)
| | | |-- string_value: string (nullable = true)
I need to parse it using pyspark and the result dataframe should be like this:
event_date | up_account_id_int | up_account_id_set_timestamp_micros | up_doc_id_set_timestamp_micros | up_user_id_set_timestamp_micros | up_user_id_string |
---|---|---|---|---|---|
20221208 | 3176465 | 1670450323992556 | null | 1670450323992556 | 430fdfc579f55f9859173c1bea39713dc11c3ba62e83c24830e3d5936f43c26d |
20221207 | null | null | 1670450329209558 | 1670450329209558 | null |
Any ideas on how can I accomplish it?
2
Answers
You can use this function:
Before flattening:
After flattening:
First you can
explode
the array then flatten struct withselect
.And it seems you are pivoting the data. This won’t give you the exact dataframe as you posted but you should be able to transform it as you like.