skip to Main Content

I have a JSON file in the following structure:

{
    "10712": {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    "217": {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
}

This causes a problem when importing using spark.read.json, because of the inconsistency in the schema: (10712, 217, etc.).

I’m trying to always replace the first JSON level with the string "user", like so:

{
    "user": {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    "user": {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
}

Alternatively, it would be also be fine to simply remove that schema level, to look like this:

[
    {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
]

Thanks!

2

Answers


  1. Try with stack() + groupBy() + collect_list() functions to unnest the top level struct and recreate the struct by changing the name of the top level field.

    Example:

    df.printSchema()
    df = spark.read.json(sc.parallelize([json_string]), multiLine=True)
    
    #get the column names and add backtick to the names
    cols = [ f"`{i}`" for i in df.columns]
    
    #get the columns length of the dataframe
    col_len = len(cols)
    stack_expr = ','.join(cols)
    df.select(expr(f"stack({col_len},{stack_expr})")).
      groupBy(lit(1)).
        agg(to_json(collect_list(col("col0"))).alias("user")).drop("1").
          show(10,False)
    
    #get columns data into rows and perform groupby with collect_list
    df.select(expr(f"stack({col_len},{stack_expr})")).
      groupBy(lit(1)).
        agg(collect_list(col("col0")).alias("user")).drop("1").
          printSchema()
    

    Output:

    #input
    root
     |-- 10712: struct (nullable = true)
     |    |-- age: long (nullable = true)
     |    |-- gender: string (nullable = true)
     |    |-- id: string (nullable = true)
     |-- 217: struct (nullable = true)
     |    |-- age: long (nullable = true)
     |    |-- gender: string (nullable = true)
     |    |-- id: string (nullable = true)
    
    #dataframe after groupBy
    +---------------------------------------------------------------------------------+
    |user                                                                             |
    +---------------------------------------------------------------------------------+
    |[{"age":27,"gender":"male","id":"10712"},{"age":60,"gender":"female","id":"217"}]|
    +---------------------------------------------------------------------------------+
    
    #schema of the dataframe
    root
     |-- user: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- age: long (nullable = true)
     |    |    |-- gender: string (nullable = true)
     |    |    |-- id: string (nullable = true)
    
    Login or Signup to reply.
  2. with open("file_path.json", "r) as f:

    json_string = f.read()

    json_as_dict = json.loads(json_string)

    list_of_dicts = list(json_as_dict.values())

    df = spark.createDataFrame(list_of_dicts)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search