I have a JSON file in the following structure:
{
"10712": {
"id": "10712",
"age": 27,
"gender": "male"
},
"217": {
"id": "217",
"age": 60,
"gender": "female"
}
}
This causes a problem when importing using spark.read.json
, because of the inconsistency in the schema: (10712, 217, etc.)
.
I’m trying to always replace the first JSON level with the string "user", like so:
{
"user": {
"id": "10712",
"age": 27,
"gender": "male"
},
"user": {
"id": "217",
"age": 60,
"gender": "female"
}
}
Alternatively, it would be also be fine to simply remove that schema level, to look like this:
[
{
"id": "10712",
"age": 27,
"gender": "male"
},
{
"id": "217",
"age": 60,
"gender": "female"
}
]
Thanks!
2
Answers
Try with
stack() + groupBy() + collect_list()
functions to unnest the top level struct and recreate the struct by changing the name of the top level field.Example:
Output:
with open("file_path.json", "r) as f:
json_string = f.read()
json_as_dict = json.loads(json_string)
list_of_dicts = list(json_as_dict.values())
df = spark.createDataFrame(list_of_dicts)