Read Json in Pyspark - PhpOut

DouglasOliveira
December 20, 2022
245 views
0 votes
2 Answers

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?

I have already tried this code:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn’t work: in parquet file just the first line appears.

I just want to read this JSON file and save as parquet…

Tags: amazon-web-services aws-glue json pyspark python

Answers

- JayPeerachai
- December 20, 2022 at 5:15 am
- 0 votes
0
Try to read as a text file first, and parse it to a json object
```
from pyspark.sql.functions import from_json, col
import json

lines = spark.read.text("data.json")
parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))

# Convert JSON objects --> a DataFrame
df = parsed_lines.toDF()
df.write.parquet("data.parquet")
```
Login or Signup to reply.

- Frosty
- December 20, 2022 at 11:08 am
- 0 votes
0
Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object. So if you set multiline parameter as False it will work as expected.
```
df= spark.read.option("multiline", "false").json("data.json")
df.show()
```
In case if your JSON file would have had a JSON array in file like
```
[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]
```
or
```
[
    {
        "id": 1, 
        "name": "jhon"
    },
    {
        "id": 2, 
        "name": "bryan"
    }
]
```
multiline parameter set to True will work.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.