Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Parsing Json with Pyspark Transformations

jmoore00
April 29, 2023
146 views
0 votes
2 Answers

I have a JSON file of the form:

{
42 : {"name": "MeowBark", "id": 42, "category": "pet store"},
67 : {"name": "Chef's Kiss", "id": 67, "category": "restaurant"},
}

I have to parse this in Pyspark and am using the following code:

stores = spark.read.json(stores, multiLine = True).cache()

This is not returning the desired dataframe, and is instead returning:

| key1                       | key2                           |
|----------------------------|--------------------------------|
|{MeowBark, true, pet store} |{Chef's Kiss, false, restuarant}|

I have tried using pd.read_json and parses it correctly once I transpose the dataframe, but I can’t use pd.read_json and need to only use spark’s transformations.

I tried defining the StructType but the challenge in this case is that ‘key1’ isn’t consistent because it refers to the row number.

Does anyone have an idea of what I’m doing wrong and what I should be doing differently? I’m totally at a loss here. Any help would be appreciated!

Answers

You can also try below code –

import json

simple_json = {}

with open("test_2.json") as file:
    data = json.load(file)

lst = []
for k, v in data.items():
    lst.append(v)
simple_json["results"] = lst

rddjson = sc.parallelize([simple_json])
df = sqlContext.read.json(rddjson, multiLine=True)
df.show()

from pyspark.sql import functions as F
df.select(F.explode(df.results).alias('results')).select('results.*').show(truncate=False)

I tried using a parallelize collection. See if this code works for you.

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define the schema for the JSON data
schema = StructType([
    StructField("name", StringType(), True),
    StructField("id", IntegerType(), True),
    StructField("category", StringType(), True)
])

# Parse the JSON data
json_string = '{ "42": { "name": "MeowBark", "id": 42, "category": "pet store" }, "67": { "name": "Chef's Kiss", "id": 67, "category": "restaurant" } }'
stores = spark.read.json(sc.parallelize([json_string]), multiLine=True)

# Convert the parsed data to a DataFrame with the desired schema
stores_df = stores.select(from_json(col("value"), schema).alias("data")).select("data.*")

# Show the resulting DataFrame
stores_df.show()


+---------+---+-----------+
|     name| id|   category|
+---------+---+-----------+
|  MeowBark| 42|   pet store|
|Chef's Kiss| 67|restaurant|
+---------+---+-----------+

Please signup or login to give your own answer.

Click here to cancel reply.