Reading a multiple line JSON with pyspark

Steuv
July 20, 2023
191 views
0 votes
2 Answers

I can’t manage to read a JSON file in Python with pyspark because it has multiple records with each variable on a different line.

Exemple :

{
  "id" : "id001",
  "name" : "NAME001",
  "firstname" : "FIRSTNAME001"
}
{
  "id" : "NNI002",
  "name" : "NAME002",
  "firstname" : "FIRSTNAME002"
}
{
  "id" : "NNI003",
  "name" : "NAME003",
  "firstname" : "FIRSTNAME003"
}

I want to load it as such :

+------------+------+-------+
|   firstname|    id|   name|
+------------+------+-------+
|FIRSTNAME001| id001|NAME001|
|FIRSTNAME002|NNI002|NAME002|
|FIRSTNAME003|NNI003|NAME003|
+------------+------+-------+

I got errors if I try spark.read.json("file.json").

And when I usespark.read.option("multiline","true").json("file.json") I got only the first record :

+------------+-----+-------+
|   firstname|   id|   name|
+------------+-----+-------+
|FIRSTNAME001|id001|NAME001|
+------------+-----+-------+

I can read it with spark.read.json("file.json") if I put every records in there own line :

{    "id" : "id001",    "name" : "NAME001",    "firstname" : "FIRSTNAME001"  }
{    "id" : "NNI002",    "name" : "NAME002",    "firstname" : "FIRSTNAME002"  }
{    "id" : "NNI003",    "name" : "NAME003",    "firstname" : "FIRSTNAME003"  }

But as I got 10M lines, it’s not realy an option.

If any one got ideas to help me I would really appreciate.

Thanks a lot.

Answers

I hope this solution well find you.

To read a JSON file in Python with PySpark when it contains multiple records with each variable on a different line, you can use a custom approach to handle the file format. Here is a potential solution:

Read the file using the textFile() method to load it as an RDD (Resilient Distributed Dataset). This will allow you to process each line individually.

lines = spark.sparkContext.textFile("file.json")
Use the map() function to transform each line, ensuring that each line is a single JSON object. You can achieve this by appending a comma (,) at the end of each line except the last line, effectively concatenating all the lines into a single JSON array.

import json

processed_lines = lines.map(lambda line: line + "," if not line.endswith("}") else line)
json_array = "[" + processed_lines.reduce(lambda a, b: a + b) + "]"
Convert the JSON array string back to an RDD using parallelize().

json_rdd = spark.sparkContext.parallelize([json_array])
Use the read() method to read the JSON data as a DataFrame.

df = spark.read.json(json_rdd)
If necessary, select the desired columns from the DataFrame to obtain the desired output.

output_df = df.select("firstname", "id", "name")
output_df.show()
This approach processes each line individually and concatenates them into a single JSON array, which can be read into a DataFrame using spark.read.json(). By applying these steps, you should be able to parse the JSON file with multiple records on different lines and obtain the expected output.

Note: Keep in mind that processing large JSON files can be memory-intensive. Ensure that your Spark cluster has enough memory to handle the size of the file. Additionally, consider partitioning the data or using other optimization techniques if needed.

- NivedithaS
- July 20, 2023 at 6:18 pm
- 0 votes
0
Maybe because the input is not a valid json.
Your input should have been like this for your code to work fine.
```
[{
  "id" : "id001",
  "name" : "NAME001",
  "firstname" : "FIRSTNAME001"
},
{
  "id" : "NNI002",
  "name" : "NAME002",
  "firstname" : "FIRSTNAME002"
},
{
  "id" : "NNI003",
  "name" : "NAME003",
  "firstname" : "FIRSTNAME003"
}]
```
Now this will work.
```
spark.read.option("multiline","true").json("file.json")
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.