Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Pyspark – How to read json with nested arrays as "column-row" or "key-value"

zenzo
October 31, 2023
270 views
0 votes
2 Answers

I have a json file just like below, and I need to read it and generate a table with the attributes of the person.

{
  "person":[
      [
      "name",
      "Guy"
      ],
      [
      "age",
      "25"
      ],
      [
       "height",
       "2.00"
      ]
  ]
}

name	age	height
Guy	25	2.00

What’s the easiest way and performatic way to read this json and output a table?

I’m thinking about converting the list as key-values pair, but since i’m working with loads of data it would be underperformatic.

And I’m having trouble exploding it because of other data in the dataframe.

Answers

- Doof
- October 31, 2023 at 8:33 pm
- 0 votes
0
You can read do this using below command, specify mulitline=True
```
your_df = spark.read.option("multiLine", "true").json(
    "yourjsonpath.json"
)
```
Also above question is answered before
How to create a spark DataFrame from Nested JSON structure
Login or Signup to reply.

Try this:

import pyspark.sql.functions as f

# get the fields that are going to show up for person
# './test_json.json' is the path for the json file.

fields = (
    spark.read.option('multiLine', True).json('./test_json.json')
    .select(f.expr('transform(person, element -> element[0])').alias('fields'))
    .take(1)[0]['fields']
)
print(fields)

df = (
    spark.read.option('multiLine', True).json('./test_json.json')
    .withColumn('json_string', f.concat(
            f.lit('{'),
            f.concat_ws(',', f.expr("""transform(person, element -> concat_ws(":", concat("'", element[0], "'"), concat("'", element[1], "'")))""")),
            f.lit('}')
        )
    )
    .withColumn('json_content', f.from_json(f.col('json_string'), StructType([StructField(element, StringType(), True) for element in fields])))
    .select('json_content.*')
)
df.show(truncate=False)

Please signup or login to give your own answer.

Click here to cancel reply.