skip to Main Content

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?

I have already tried this code:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn’t work: in parquet file just the first line appears.

I just want to read this JSON file and save as parquet…

2

Answers


  1. Try to read as a text file first, and parse it to a json object

    from pyspark.sql.functions import from_json, col
    import json
    
    lines = spark.read.text("data.json")
    parsed_lines = lines.rdd.map(lambda row: json.loads(row[0]))
    
    # Convert JSON objects --> a DataFrame
    df = parsed_lines.toDF()
    df.write.parquet("data.parquet")
    
    Login or Signup to reply.
  2. Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object. So if you set multiline parameter as False it will work as expected.

    df= spark.read.option("multiline", "false").json("data.json")
    df.show()
    

    In case if your JSON file would have had a JSON array in file like

    [
    {"id": 1, "name": "jhon"},
    {"id": 2, "name": "bryan"},
    {"id": 3, "name": "jane"}
    ]
    

    or

    [
        {
            "id": 1, 
            "name": "jhon"
        },
        {
            "id": 2, 
            "name": "bryan"
        }
    ]
    

    multiline parameter set to True will work.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search