I can’t manage to read a JSON file in Python with pyspark because it has multiple records with each variable on a different line.
Exemple :
{
"id" : "id001",
"name" : "NAME001",
"firstname" : "FIRSTNAME001"
}
{
"id" : "NNI002",
"name" : "NAME002",
"firstname" : "FIRSTNAME002"
}
{
"id" : "NNI003",
"name" : "NAME003",
"firstname" : "FIRSTNAME003"
}
I want to load it as such :
+------------+------+-------+
| firstname| id| name|
+------------+------+-------+
|FIRSTNAME001| id001|NAME001|
|FIRSTNAME002|NNI002|NAME002|
|FIRSTNAME003|NNI003|NAME003|
+------------+------+-------+
I got errors if I try spark.read.json("file.json")
.
And when I usespark.read.option("multiline","true").json("file.json")
I got only the first record :
+------------+-----+-------+
| firstname| id| name|
+------------+-----+-------+
|FIRSTNAME001|id001|NAME001|
+------------+-----+-------+
I can read it with spark.read.json("file.json")
if I put every records in there own line :
{ "id" : "id001", "name" : "NAME001", "firstname" : "FIRSTNAME001" }
{ "id" : "NNI002", "name" : "NAME002", "firstname" : "FIRSTNAME002" }
{ "id" : "NNI003", "name" : "NAME003", "firstname" : "FIRSTNAME003" }
But as I got 10M lines, it’s not realy an option.
If any one got ideas to help me I would really appreciate.
Thanks a lot.
2
Answers
Maybe because the input is not a valid json.
Your input should have been like this for your code to work fine.
Now this will work.