Concat two JSON string columns with PySpark

joaopedroimamura
September 26, 2023
227 views
0 votes
2 Answers

Suppose I have two columns (StringType):

JSON 1	JSON 2
{"key1":"value1"}	{"key2":"value2"}
{"key3":"value3"}	{"key4":"value4"}

I need to concat these columns to have a LIST of JSON in the third column (also string), something like this:

JSON 1	JSON 2	JSON 3
{"key1":"value1"}	{"key2":"value2"}	[{"key1":"value1"},{"key2":"value2"}]
{"key3":"value3"}	{"key4":"value4"}	[{"key3":"value3"},{"key4":"value4"}]

Actually I solved the problem, but I don’t think it’s the best approach:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, concat, concat_ws, lit

def concat_jsons(df: DataFrame, columns: list):
    df = df.withColumn(
        'JSON 3', concat_ws(',', *columns)
    )

    return df.withColumn(
        'JSON 3',
        concat(lit('['), col('JSON 3'), lit(']'))
    )

Anyone has a better idea?

Answers

- DerekO
- September 25, 2023 at 7:56 pm
- 0 votes
0
Your solution looks fine, though I would combine the two statements in your function:
```
def concat_jsons(df: DataFrame, columns: list):
    return df.withColumn(
        'JSON 3', concat(lit('['), concat_ws(',', *columns), lit(']'))
    )
```
Login or Signup to reply.

data =[["""{"key1":"value1"}""", """{"key2":"value2"}"""], ["""{"key3":"value3"}""","""{"key4":"value4"}"""] ]

df = spark.createDataFrame(data).toDF('JSON 1','JSON 2')

df.show()
+-----------------+-----------------+
|           JSON 1|           JSON 2|
+-----------------------------------+
|{"key1":"value1"}|{"key2":"value2"}|
|{"key3":"value3"}|{"key4":"value4"}|
+-----------------+-----------------+

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns))

concat_jsons(df, ['JSON 1','JSON 2']).show()

+-----------------+-----------------+--------------------+
|           JSON 1|           JSON 2|              JSON 3|
+-----------------+-----------------+--------------------+
|{"key1":"value1"}|{"key2":"value2"}|[{"key1":"value1"...|
|{"key3":"value3"}|{"key4":"value4"}|[{"key3":"value3"...|
+-----------------+-----------------+--------------------+

if you want "JSON 3" to be for StringType, cast it to string

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns).cast('string'))

Please signup or login to give your own answer.

Click here to cancel reply.