I have a JSON file and I need to convert it into tabular form by using only Pyspark.
My JSON file :-
{
"records": [
{
"name": "Priya",
"last_name": "Munjal",
"special_values": [
{
"name": "adress",
"value": "some adress"
},
{
"name": "city",
"value": "Chd"
},
{
"name": "zip_code",
"value": "134112"
}
]
},
{
"name": "Neha",
"last_name": "Miglani",
"special_values": [
{
"name": "adress",
"value": "some adress"
},
{
"name": "city",
"value": "kkr"
},
{
"name": "zip_code",
"value": "02221"
}
]
}
]
}
Result that I want :-
name|last_name|address|city|zip_code
priya|munjal|some adress|Chd|02193
neha|miglani|some adress|kkr|02221
I have tried this code :-
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode
# Initialize SparkSession
spark = SparkSession.builder.appName("JSONTransformation").getOrCreate()
# Read the input JSON file
input_path = "path_to_your_spark_ex.json"
df = spark.read.json(input_path)
# Explode the 'records' array and select required columns
flattened_df = df.select(
explode(col('records')).alias('record')
).select(
col('record.name').alias('name'),
col('record.last_name').alias('last_name'),
col('record.special_values').alias('special_values')
)
# Create a DataFrame from the 'special_values' array
values_df = flattened_df.select(
col('name'),
col('last_name'),
col('special_values')[0]['value'].alias('address'),
col('special_values')[1]['value'].alias('city'),
col('special_values')[2]['value'].alias('zip_code')
)
# Show the result
values_df.show()
# Stop SparkSession
spark.stop()
but not getting the result. I have done using Pandas but need to be done only in Pyspark now
2
Answers
special_values.value
concat_ws
function to concat array of string as comma separated values.from_csv
function to apply schema to the comma separated values*
to extract attributes fromstruct
data type.Your code works for me after I add
.option('multiLine', True)
to the read method.Result: