skip to Main Content

I have 2 JSON files, one is like:

{
  "a":{
    "a1":"xxx"
  },
  "b":"xxx"
}

Another one is like:

{
  "a":{
    "a1":"xxx",
    "a2":"xxx"
  },
  "b":"xxx"
}

And I want to read these two JSON files into one Dataframe in Spark. I tried to use union and unionByName but they didn’t work. How can I achieve this?

2

Answers


  1. If you’ve got a couple of JSON files with different columns and wanna smoosh them into one DataFrame in Spark, you’re in luck. Spark’s got this cool feature that lets you merge schemas on the fly. Just use the mergeSchema option when you read your JSONs. Here’s a quick way to do it in PySpark:

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("Cool JSON Merge").getOrCreate()
    
    df = spark.read.option("mergeSchema", "true").json("path/to/your/json/files/*")
    
    df.show()
    

    Just swap in your actual file path. This tells Spark to chill and combine those columns together, even if some JSONs have extra fields.

    A heads-up: turning on mergeSchema might slow things down a bit since Spark has to read through the files an extra time to figure out the schema. But hey, if you need everything in one place, it’s totally worth it.

    Hope that helps you out!

    Login or Signup to reply.
  2. Spark can take care of merging the schema. See the following code:

    >>> spark.read.option("multiLine", True).json("test-jsons/*").printSchema()
    root
     |-- a: struct (nullable = true)
     |    |-- a1: string (nullable = true)
     |    |-- a2: string (nullable = true)
     |-- b: string (nullable = true)
    
    >>> spark.read.option("multiLine", True).json("test-jsons/*").show()
    +-----------+---+
    |          a|  b|
    +-----------+---+
    | {xxx, xxx}|xxx|
    |{xxx, NULL}|xxx|
    +-----------+---+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search