skip to Main Content

I have been able to write to console the json file I want to work on to console. Please, how do I separate the ‘value’ column into columns of data as in the json and write to delta lake for sql query and MLlib? Thanks.

{"coord": {"lon": -1.15, "lat": 52.95}, "list": [{"main": {"aqi": 2}, "components": {"co": 220.3, "no": 0.26, "no2": 5.14, "o3": 75.1, "so2": 1.54, "pm2_5": 1.8, "pm10": 2.71, "nh3": 2.79}, "dt": 1679418000}, {"main": {"aqi": 2}, "components": {"co": 220.3, "no": 0.07, "no2": 7.45, "o3": 72.24, "so2": 2.18, "pm2_5": 1.9, "pm10": 2.9, "nh3": 3.45}, "dt": 1679421600}}

Value result image here

2

Answers


  1. Chosen as BEST ANSWER

    I defined an arraytype struct schema for the json value I want to explode;

    schema = "array<struct<main:struct<aqi:int>, components:struct<co:double, no:double, no2:double ...>>"
    

    Then create a data frame with;

    df_new = df.select(get_json_object(col("value"), "$.coord").alias("coord"),explode(from_json(get_json_object(col("value"),"$.list"), schema)).alias("exploded_col"),"timestamp")
    

    I extracted value of the explode column into different columns with;

    df_stream = df_new.select(
      col("exploded_col.main.aqi").alias("aqi"), 
      col("exploded_col.components.co").alias("co"), 
      col("exploded_col.components.no").alias("no")
      ...
    )
    

  2. Use get_json_object for each field you want, ex.

    get_json_object(col("value"), "$.coord").alias("coord")
    

    For the list field, you need to explode

    explode(get_json_object(col("value"), "$.list"))) 
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search