skip to Main Content

I can’t manage to read a JSON file in Python with pyspark because it has multiple records with each variable on a different line.

Exemple :

{
  "id" : "id001",
  "name" : "NAME001",
  "firstname" : "FIRSTNAME001"
}
{
  "id" : "NNI002",
  "name" : "NAME002",
  "firstname" : "FIRSTNAME002"
}
{
  "id" : "NNI003",
  "name" : "NAME003",
  "firstname" : "FIRSTNAME003"
}

I want to load it as such :

+------------+------+-------+
|   firstname|    id|   name|
+------------+------+-------+
|FIRSTNAME001| id001|NAME001|
|FIRSTNAME002|NNI002|NAME002|
|FIRSTNAME003|NNI003|NAME003|
+------------+------+-------+

I got errors if I try spark.read.json("file.json").

And when I usespark.read.option("multiline","true").json("file.json") I got only the first record :

+------------+-----+-------+
|   firstname|   id|   name|
+------------+-----+-------+
|FIRSTNAME001|id001|NAME001|
+------------+-----+-------+

I can read it with spark.read.json("file.json") if I put every records in there own line :

{    "id" : "id001",    "name" : "NAME001",    "firstname" : "FIRSTNAME001"  }
{    "id" : "NNI002",    "name" : "NAME002",    "firstname" : "FIRSTNAME002"  }
{    "id" : "NNI003",    "name" : "NAME003",    "firstname" : "FIRSTNAME003"  }

But as I got 10M lines, it’s not realy an option.

If any one got ideas to help me I would really appreciate.

Thanks a lot.

2

Answers


  1. I hope this solution well find you.
    
    To read a JSON file in Python with PySpark when it contains multiple records with each variable on a different line, you can use a custom approach to handle the file format. Here is a potential solution:
    
    Read the file using the textFile() method to load it as an RDD (Resilient Distributed Dataset). This will allow you to process each line individually.
    
    lines = spark.sparkContext.textFile("file.json")
    Use the map() function to transform each line, ensuring that each line is a single JSON object. You can achieve this by appending a comma (,) at the end of each line except the last line, effectively concatenating all the lines into a single JSON array.
    
    import json
    
    processed_lines = lines.map(lambda line: line + "," if not line.endswith("}") else line)
    json_array = "[" + processed_lines.reduce(lambda a, b: a + b) + "]"
    Convert the JSON array string back to an RDD using parallelize().
    
    json_rdd = spark.sparkContext.parallelize([json_array])
    Use the read() method to read the JSON data as a DataFrame.
    
    df = spark.read.json(json_rdd)
    If necessary, select the desired columns from the DataFrame to obtain the desired output.
    
    output_df = df.select("firstname", "id", "name")
    output_df.show()
    This approach processes each line individually and concatenates them into a single JSON array, which can be read into a DataFrame using spark.read.json(). By applying these steps, you should be able to parse the JSON file with multiple records on different lines and obtain the expected output.
    
    Note: Keep in mind that processing large JSON files can be memory-intensive. Ensure that your Spark cluster has enough memory to handle the size of the file. Additionally, consider partitioning the data or using other optimization techniques if needed.
    
    Login or Signup to reply.
  2. Maybe because the input is not a valid json.
    Your input should have been like this for your code to work fine.

    [{
      "id" : "id001",
      "name" : "NAME001",
      "firstname" : "FIRSTNAME001"
    },
    {
      "id" : "NNI002",
      "name" : "NAME002",
      "firstname" : "FIRSTNAME002"
    },
    {
      "id" : "NNI003",
      "name" : "NAME003",
      "firstname" : "FIRSTNAME003"
    }]
    

    Now this will work.

    spark.read.option("multiline","true").json("file.json")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search