Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Spark handle json with dynamically named subschema

JonathanM
May 23, 2023
261 views
0 votes
2 Answers

I have a JSON file in the following structure:

{
    "10712": {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    "217": {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
}

This causes a problem when importing using spark.read.json, because of the inconsistency in the schema: (10712, 217, etc.).

I’m trying to always replace the first JSON level with the string "user", like so:

{
    "user": {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    "user": {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
}

Alternatively, it would be also be fine to simply remove that schema level, to look like this:

[
    {
        "id": "10712",
        "age": 27,
        "gender": "male"
    },
    {
        "id": "217",
        "age": 60,
        "gender": "female"
    }
]

Thanks!

Answers

Try with stack() + groupBy() + collect_list() functions to unnest the top level struct and recreate the struct by changing the name of the top level field.

Example:

df.printSchema()
df = spark.read.json(sc.parallelize([json_string]), multiLine=True)

#get the column names and add backtick to the names
cols = [ f"`{i}`" for i in df.columns]

#get the columns length of the dataframe
col_len = len(cols)
stack_expr = ','.join(cols)
df.select(expr(f"stack({col_len},{stack_expr})")).
  groupBy(lit(1)).
    agg(to_json(collect_list(col("col0"))).alias("user")).drop("1").
      show(10,False)

#get columns data into rows and perform groupby with collect_list
df.select(expr(f"stack({col_len},{stack_expr})")).
  groupBy(lit(1)).
    agg(collect_list(col("col0")).alias("user")).drop("1").
      printSchema()

Output:

#input
root
 |-- 10712: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- id: string (nullable = true)
 |-- 217: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- id: string (nullable = true)

#dataframe after groupBy
+---------------------------------------------------------------------------------+
|user                                                                             |
+---------------------------------------------------------------------------------+
|[{"age":27,"gender":"male","id":"10712"},{"age":60,"gender":"female","id":"217"}]|
+---------------------------------------------------------------------------------+

#schema of the dataframe
root
 |-- user: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- age: long (nullable = true)
 |    |    |-- gender: string (nullable = true)
 |    |    |-- id: string (nullable = true)

- LeelaSamyuktha
- May 23, 2023 at 10:34 pm
- 0 votes
0
with open("file_path.json", "r) as f:

json_string = f.read()

json_as_dict = json.loads(json_string)

list_of_dicts = list(json_as_dict.values())

df = spark.createDataFrame(list_of_dicts)

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.