Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

How to Flatten JSON file using pyspark

Priya
August 8, 2023
250 views
0 votes
2 Answers

I need to flatten JSON file so that I can get output in table format.Ihavetried but not getting the output that I want

This is my JSON file :-

{
    "records": [
        {
            "name": "A",
            "last_name": "B",
            "special_values": [
                {
                    "name": "address",
                    "value": "some adress"
                },
                {
                    "name": "city",
                    "value": "Chd"
                },
                {
                    "name": "zip_code",
                    "value": "160036"
                }
            ]
        },
        {
            "name": "X",
            "last_name": "Y",
            "special_values": [
                {
                    "name": "adress",
                    "value": "some adress"
                },
                {
                    "name": "city",
                    "value": "Dallas"
                },
                {
                    "name": "zip_code",
                    "value": "02431"
                }
            ]
        }
    ]
}

I want this output:-

|name|last_name|address|city  |zip_code|FIELD7|
|----|---------|-------|------|--------|------|
|A   |B        |some   |adress|chd     |160036|
|X   |Y        |some   |adress|Dallas  |02431 |

I have tried this code but getting different output :-

df = spark.read.option("multiline","true").json(r"C:UsersLajoDownloadsspark_ex2_input.json")
from pyspark.sql.types import *
from pyspark.sql.functions import explode_outer,col


def flatten(df):
   # compute Complex Fields (Lists and Structs) in Schema   
   complex_fields = dict([(field.name, field.dataType)
                             for field in df.schema.fields
                             if type(field.dataType) == ArrayType or  type(field.dataType) == StructType])
   while len(complex_fields)!=0:
      col_name=list(complex_fields.keys())[0]
      print ("Processing :"+col_name+" Type : "+str(type(complex_fields[col_name])))
    
      # if StructType then convert all sub element to columns.
      # i.e. flatten structs
      if (type(complex_fields[col_name]) == StructType):
         expanded = [col(col_name+'.'+k).alias(col_name+'_'+k) for k in [ n.name for n in  complex_fields[col_name]]]
         df=df.select("*", *expanded).drop(col_name)
    
      # if ArrayType then add the Array Elements as Rows using the explode function
      # i.e. explode Arrays
      elif (type(complex_fields[col_name]) == ArrayType):    
         df=df.withColumn(col_name,explode_outer(col_name))
    
      # recompute remaining Complex Fields in Schema       
      complex_fields = dict([(field.name, field.dataType)
                             for field in df.schema.fields
                             if type(field.dataType) == ArrayType or  type(field.dataType) == StructType])
   return df

df_flatten = flatten(df)
df_flatten.show()

Answers

in my opinion it is much easier with the library ‘Pandas‘ and less code is needed.

df = pd.read_json("your_json_file.json")
records = df["records"]
flattened_data = []
for record in records:
    row = {
        "name": record["name"],
        "last_name": record["last_name"]
    }

    for special_value in record["special_values"]:
        row[special_value["name"]] = special_value["value"]
    flattened_data.append(row)
df = pd.DataFrame(flattened_data)
print(df.to_csv(sep="|", index=False))

the result looks like this

name|last_name|address|city|zip_code|adress
A|B|some adress|Chd|160036|
X|Y||Dallas|02431|some adress

- JoshuaShew
- August 8, 2023 at 9:50 am
- 0 votes
0
One option is to flatten the data before making it into a data frame. Consider reading the JSON file with the built-in json library. Then you can perform the following operation on the resulting data object.
```
data = data["records"] # It seems that the data you want is in "records"
for entry in data:
    for special_value in entry["special_values"]: # Add each special value to the entry
        entry[special_value["name"]] = special_value["value"]
    del entry["special_values"] # Remove the "special_values" list
```
Then you can create a data frame from the data object.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.