skip to Main Content

I have a JSON file of the form:

{
42 : {"name": "MeowBark", "id": 42, "category": "pet store"},
67 : {"name": "Chef's Kiss", "id": 67, "category": "restaurant"},
}

I have to parse this in Pyspark and am using the following code:

stores = spark.read.json(stores, multiLine = True).cache()

This is not returning the desired dataframe, and is instead returning:

| key1                       | key2                           |
|----------------------------|--------------------------------|
|{MeowBark, true, pet store} |{Chef's Kiss, false, restuarant}|

I have tried using pd.read_json and parses it correctly once I transpose the dataframe, but I can’t use pd.read_json and need to only use spark’s transformations.

I tried defining the StructType but the challenge in this case is that ‘key1’ isn’t consistent because it refers to the row number.

Does anyone have an idea of what I’m doing wrong and what I should be doing differently? I’m totally at a loss here. Any help would be appreciated!

2

Answers


  1. You can also try below code –

    import json
    
    simple_json = {}
    
    with open("test_2.json") as file:
        data = json.load(file)
    
    lst = []
    for k, v in data.items():
        lst.append(v)
    simple_json["results"] = lst
    
    rddjson = sc.parallelize([simple_json])
    df = sqlContext.read.json(rddjson, multiLine=True)
    df.show()
    
    from pyspark.sql import functions as F
    df.select(F.explode(df.results).alias('results')).select('results.*').show(truncate=False)
    
    Login or Signup to reply.
  2. I tried using a parallelize collection. See if this code works for you.

    from pyspark.sql.functions import from_json, col
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    # Define the schema for the JSON data
    schema = StructType([
        StructField("name", StringType(), True),
        StructField("id", IntegerType(), True),
        StructField("category", StringType(), True)
    ])
    
    # Parse the JSON data
    json_string = '{ "42": { "name": "MeowBark", "id": 42, "category": "pet store" }, "67": { "name": "Chef's Kiss", "id": 67, "category": "restaurant" } }'
    stores = spark.read.json(sc.parallelize([json_string]), multiLine=True)
    
    # Convert the parsed data to a DataFrame with the desired schema
    stores_df = stores.select(from_json(col("value"), schema).alias("data")).select("data.*")
    
    # Show the resulting DataFrame
    stores_df.show()
    
    
    +---------+---+-----------+
    |     name| id|   category|
    +---------+---+-----------+
    |  MeowBark| 42|   pet store|
    |Chef's Kiss| 67|restaurant|
    +---------+---+-----------+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search