Azure - how to convert an string representation of an array into an actual array type in pyspark

younus
March 4, 2024
132 views
3 votes
3 Answers

I have a column with data coming in as an string representation of an array

I tried to type cast it to an array type but the data is getting modified.

I tried to use regex as well to remove the extra brackets but its not working.

attaching the code below
This code is to convert the string representation of array to actual array

df = df.withColumn("columns", split(df["columns"], ", "))

This is the regex code i tried

df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('"[]', x))")

)

would really appreciate any help

Answers

Following is one possible way to do it.

import sys
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark import SQLContext

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data = [
              (1, '["First", "Second", "Third"]'),
              (2, '["First"]'),
              (3, '["Second", "Third"]'),
              (4, '["First", "Fourth"]')
            ]

df1 = sqlContext.createDataFrame(data, ['id', 'val'])

df1.show(n=100, truncate=False)
print("Collect columns into list")

my_schema = ArrayType(StringType())


intermediate_df = df1.withColumn("array_string", F.from_json("val", schema=my_schema))

print("intermediate_df dataframe")
intermediate_df.show(n=20, truncate=False)

Output :

+---+----------------------------+
|id |val                         |
+---+----------------------------+
|1  |["First", "Second", "Third"]|
|2  |["First"]                   |
|3  |["Second", "Third"]         |
|4  |["First", "Fourth"]         |
+---+----------------------------+

intermediate_df dataframe
+---+----------------------------+----------------------+
|id |val                         |array_string          |
+---+----------------------------+----------------------+
|1  |["First", "Second", "Third"]|[First, Second, Third]|
|2  |["First"]                   |[First]               |
|3  |["Second", "Third"]         |[Second, Third]       |
|4  |["First", "Fourth"]         |[First, Fourth]       |
+---+----------------------------+----------------------+

- spolam
- March 4, 2024 at 1:02 pm
- 0 votes
0
Simple, You can use from_json with array<string> schema
```
df.selectExpr("*", "from_json(columns, 'array<string>') AS columns")
```
Login or Signup to reply.

- DileeprajnarayanThumula
- March 4, 2024 at 1:03 pm
- 0 votes
0
I have tried the following approach:
```
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import ArrayType, StringType
data = [("1", "["a", "b", "c"]"), ("2", "["d", "e", "f"]")]
df = spark.createDataFrame(data, ["id", "columns"])
array_schema = ArrayType(StringType())
df = df.withColumn("columns", from_json(col("columns"), array_schema))
df.show()
```
In the above code, I defined the schema for the array using ArrayType(StringType()). Next, I specified that the array contains strings.

Using the withColumn method, combined with from_json, I transformed the "columns" column into an array of strings based on the specified schema.

Results:
```
+---+---------+
| id|  columns|
+---+---------+
|  1|[a, b, c]|
|  2|[d, e, f]|
+---+---------+
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure – how to convert an string representation of an array into an actual array type in pyspark

Answers