skip to Main Content

Suppose I have two columns (StringType):

JSON 1 JSON 2
{"key1":"value1"} {"key2":"value2"}
{"key3":"value3"} {"key4":"value4"}

I need to concat these columns to have a LIST of JSON in the third column (also string), something like this:

JSON 1 JSON 2 JSON 3
{"key1":"value1"} {"key2":"value2"} [{"key1":"value1"},{"key2":"value2"}]
{"key3":"value3"} {"key4":"value4"} [{"key3":"value3"},{"key4":"value4"}]

Actually I solved the problem, but I don’t think it’s the best approach:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, concat, concat_ws, lit

def concat_jsons(df: DataFrame, columns: list):
    df = df.withColumn(
        'JSON 3', concat_ws(',', *columns)
    )

    return df.withColumn(
        'JSON 3',
        concat(lit('['), col('JSON 3'), lit(']'))
    )

Anyone has a better idea?

2

Answers


  1. Your solution looks fine, though I would combine the two statements in your function:

    def concat_jsons(df: DataFrame, columns: list):
        return df.withColumn(
            'JSON 3', concat(lit('['), concat_ws(',', *columns), lit(']'))
        )
    
    Login or Signup to reply.
  2. data =[["""{"key1":"value1"}""", """{"key2":"value2"}"""], ["""{"key3":"value3"}""","""{"key4":"value4"}"""] ]
    
    df = spark.createDataFrame(data).toDF('JSON 1','JSON 2')
    
    df.show()
    +-----------------+-----------------+
    |           JSON 1|           JSON 2|
    +-----------------------------------+
    |{"key1":"value1"}|{"key2":"value2"}|
    |{"key3":"value3"}|{"key4":"value4"}|
    +-----------------+-----------------+
    
    def concat_jsons(df: DataFrame, columns: list):
    ...   return df.withColumn('JSON 3',array(*columns))
    
    concat_jsons(df, ['JSON 1','JSON 2']).show()
    
    +-----------------+-----------------+--------------------+
    |           JSON 1|           JSON 2|              JSON 3|
    +-----------------+-----------------+--------------------+
    |{"key1":"value1"}|{"key2":"value2"}|[{"key1":"value1"...|
    |{"key3":"value3"}|{"key4":"value4"}|[{"key3":"value3"...|
    +-----------------+-----------------+--------------------+
    

    if you want "JSON 3" to be for StringType, cast it to string

    def concat_jsons(df: DataFrame, columns: list):
    ...   return df.withColumn('JSON 3',array(*columns).cast('string'))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search