I have a array with JSON strings and each JSON has a different key. Also, the size of the array is not fixed and may increase or decrease in occurrences.
An example of this is below :
"[{"key1": 10},{"key2":"val2"}]"
I have this as a column in a delta table which is the data which has been ingested from the vendor.
My question is how do we convert this to a format so I can easily access the keys and values for each occurrence using only Databricks SQL?
If the json has a fixed key, it can easily be achieved by casting into a STRUCT. But since the keys are different, I cannot use that.
I tried using explode which gave me 2 rows, each with a json string. But again, How can I get the keys and values from each of those json strings using just Databricks SQL? Also, in this case, the Integer values are also converted to strings which means I lose the datatype.
What I’m looking for is a VARIANT version of this.
[{"key_name":"key1","key_value":10},{"key_name":"key2","key_value":"val2"}]
What I have done so far :
%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
SELECT '[{"key1":10},{"key2":"val2"}]' as properties;
CREATE OR REPLACE TEMPORARY VIEW tmp_v1 AS
SELECT from_json(properties, 'ARRAY<STRING>') as col FROM tmp_array1;
SELECT * FROM tmp_v1;
Gives me
And then
%sql
SELECT EXPLODE(col) FROM tmp_v1;
Gives me
So, I can access each occurences but how I access each value when I do not know what the key name may be?
2
Answers
I was able to achieve this using Databricks functions using Python. To use this, Unity catalog is mandatory.
First I create my dummy test data
Create a function
Now I can parse the flattened string.
I’m not exactly sure what you’re looking for, but here is some examples of transforming the data in such a way that at the end I’m working with rows of key and value columns.
and another way, this time presenting the combination of all the key/value pairs in a JSON array: