skip to Main Content

I have a array with JSON strings and each JSON has a different key. Also, the size of the array is not fixed and may increase or decrease in occurrences.
An example of this is below :

"[{"key1": 10},{"key2":"val2"}]"

I have this as a column in a delta table which is the data which has been ingested from the vendor.

My question is how do we convert this to a format so I can easily access the keys and values for each occurrence using only Databricks SQL?
If the json has a fixed key, it can easily be achieved by casting into a STRUCT. But since the keys are different, I cannot use that.

I tried using explode which gave me 2 rows, each with a json string. But again, How can I get the keys and values from each of those json strings using just Databricks SQL? Also, in this case, the Integer values are also converted to strings which means I lose the datatype.
What I’m looking for is a VARIANT version of this.

[{"key_name":"key1","key_value":10},{"key_name":"key2","key_value":"val2"}]

What I have done so far :

%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
SELECT '[{"key1":10},{"key2":"val2"}]' as properties;

CREATE OR REPLACE TEMPORARY VIEW tmp_v1 AS
SELECT from_json(properties, 'ARRAY<STRING>') as col FROM tmp_array1;
SELECT * FROM tmp_v1;

Gives me

enter image description here

And then

%sql
SELECT EXPLODE(col) FROM tmp_v1;

Gives me

enter image description here

So, I can access each occurences but how I access each value when I do not know what the key name may be?

2

Answers


  1. Chosen as BEST ANSWER

    I was able to achieve this using Databricks functions using Python. To use this, Unity catalog is mandatory.

    First I create my dummy test data

    %sql
    CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
        SELECT '[{"key1":10},{"key2":"val2"}]' as properties;
    
    SELECT * FROM tmp_array1;
    

    enter image description here

    Create a function

    %sql
    CREATE OR REPLACE TEMPORARY FUNCTION test_func(s STRING)
      RETURNS STRING
      LANGUAGE PYTHON
      AS $$
        import json
        def flattenjson1(inputjson):
          dict = json.loads(inputjson)
          out= []
          for i in dict:
              for key, value in i.items():  
                  out.append({"key_name":key, "key_value":value})
          #print(out)
          return json.dumps(out)
        return flattenjson1(s) if s else None
      $$
    

    Now I can parse the flattened string.

    %sql
    CREATE OR REPLACE TEMPORARY VIEW view3 AS 
    SELECT *, test_func(properties) as flatten from tmp_array1;
    
    SELECT *,parse_json(flatten) as variant FROM view3;
    

    enter image description here


  2. I’m not exactly sure what you’re looking for, but here is some examples of transforming the data in such a way that at the end I’m working with rows of key and value columns.

    ;WITH json_parsed_as_array_of_maps as 
    (
      SELECT 
        parse_json('[{"key1":10},{"key2":"val2"}]')::ARRAY<MAP<STRING,STRING>> as arr
    )--select * from json_parsed_as_array_of_maps
    , unified_map AS
    (
      SELECT 
        map_from_arrays
        (
          flatten(transform(arr, x -> MAP_KEYS(x))),
          flatten(transform(arr, x -> MAP_VALUES(x)))
        ) as one_map
      FROM json_parsed_as_array_of_maps
    )--select * from unified_map
    , exploded as
    (
      SELECT
        explode(map_entries(one_map)) as pair 
      FROM unified_map
    )--select * from exploded
    SELECT
      pair.key, 
      pair.value 
    FROM exploded
    
    /*
    key value
    ----------
    key1    10
    key2    val2
    */
    

    and another way, this time presenting the combination of all the key/value pairs in a JSON array:

    ;WITH j as 
    (
      SELECT 
        parse_json('[{"key1":10},{"key2":"val2"},{"key3":null}]')::ARRAY<STRING> as arr
    )--select * from j
    , unified_array(array_string) as
    (
      SELECT 
        '['||
          ARRAY_JOIN
          (
            COLLECT_SET
            (
              --somewhat dirty string manipulation here (flawed if key or value string contains a colon)
              '{"key":'||REPLACE(SUBSTRING(item, 2, LEN(item)-2), ':', ',"value":')||'}'
            ), 
            ','
          )
        ||']' 
        FROM (SELECT EXPLODE(arr) as item FROM j) 
    )--SELECT * FROM unified_array
    SELECT 
      *
      --array_string:[2].key, array_string:[1].value
    FROM unified_array
    
    --[{"key":"key2","value":"val2"},{"key":"key3","value":null},{"key":"key1","value":10}]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search