I have a spark dataset that has fields: "identifier_id", "inner_blob" , "json_blob"
"inner_blob": {
"identifier_id": 2.0,
"name": "test1",
"age": 30.0
},
"identifier_id": 2.0,
"json_blob": {
"identifier_id": 2.0,
"order_id": 2.0,
"inner_blob": [
{
"item_id": 23.0,
"item_name": "airpods2",
"item_price": 300.0
},
{
"item_id": 23.0,
"item_name": "airpods1",
"item_price": 600.0
}
]
}
}
How can I merge the values of two columns called "inner_blob" and "jsob_blob" into one column "json_blob" and "identifier_id" column will remain same. Actual output looks like this:
"identifier_id": 2.0,
"json_blob": {
"identifier_id": 2.0,
"name": "test1",
"age": 30.0
"order_id": 2.0,
"inner_blob": [
{
"item_id": 23.0,
"item_name": "airpods2",
"item_price": 300.0
},
{
"item_id": 23.0,
"item_name": "airpods1",
"item_price": 600.0
}
]
}
}
2
Answers
To add a new field to a column of type struct from another column in apache spark, you can use the struct function:
UPDATE:
If you are using spark >= 3.1, you can use dropFields to drop the not needed fields from a struct:
If we want to be completely agnostic of the fields that are inside
inner_blob
andjson_blob
, we can use the schema to get the column names. Then we need to decide what to do if a name is present in both structs. Let’s decide to take the one frominner_blob
and drop the one fromjson_blob
(thediff
in the code below) but we can adjust the code for any other logic.