I am new to PySpark and I am struggling with flattening a nested json file into a PySpark data frame.
I need to define the schema for the JSON data. I know how to define schema for regular json files but I get stuck at defining the schema for this case. I need generalised version of this example, not the hard coded version.
In this example I have a verified manifest and the string AB1
should go under verifiedManifest_name
and quantity should go under verifiedManifest_quantity
. I don’t know how many inputs to the verifiedManifest I’ll have nor how they will be named.
"verifiedManifest":{"AB1":{"quantity":1}, "DE5":{"quantity":5}, "AG1":{"quantity":10}}
Output for this case will look like this:
verifiedManifest_name verifiedManifest_quantity
AB1 1
DE5 5
AG1 10
DOL1 100
BG1 3
etc...
2
Answers
Suppose you have your data in a JSON file:
and the output will look like:
Use
stack
function for all the columns dynamically, by reading usingread.json
method.Example: