skip to Main Content

I have a column with data coming in as an string representation of an array

enter image description here

I tried to type cast it to an array type but the data is getting modified.

enter image description here

I tried to use regex as well to remove the extra brackets but its not working.

attaching the code below
This code is to convert the string representation of array to actual array

df = df.withColumn("columns", split(df["columns"], ", "))

This is the regex code i tried

df = df.withColumn(
'columns',
expr("transform(split(columns, ','), x -> trim('"[]', x))")

)

would really appreciate any help

3

Answers


  1. Following is one possible way to do it.

    import sys
    from pyspark import SparkContext
    from pyspark.sql.types import *
    from pyspark.sql import functions as F
    from pyspark import SQLContext
    
    sc = SparkContext('local')
    sqlContext = SQLContext(sc)
    
    data = [
                  (1, '["First", "Second", "Third"]'),
                  (2, '["First"]'),
                  (3, '["Second", "Third"]'),
                  (4, '["First", "Fourth"]')
                ]
    
    df1 = sqlContext.createDataFrame(data, ['id', 'val'])
    
    df1.show(n=100, truncate=False)
    print("Collect columns into list")
    
    my_schema = ArrayType(StringType())
    
    
    intermediate_df = df1.withColumn("array_string", F.from_json("val", schema=my_schema))
    
    print("intermediate_df dataframe")
    intermediate_df.show(n=20, truncate=False)
    

    Output :

    +---+----------------------------+
    |id |val                         |
    +---+----------------------------+
    |1  |["First", "Second", "Third"]|
    |2  |["First"]                   |
    |3  |["Second", "Third"]         |
    |4  |["First", "Fourth"]         |
    +---+----------------------------+
    
    intermediate_df dataframe
    +---+----------------------------+----------------------+
    |id |val                         |array_string          |
    +---+----------------------------+----------------------+
    |1  |["First", "Second", "Third"]|[First, Second, Third]|
    |2  |["First"]                   |[First]               |
    |3  |["Second", "Third"]         |[Second, Third]       |
    |4  |["First", "Fourth"]         |[First, Fourth]       |
    +---+----------------------------+----------------------+
    
    Login or Signup to reply.
  2. Simple, You can use from_json with array<string> schema

    df.selectExpr("*", "from_json(columns, 'array<string>') AS columns")
    
    Login or Signup to reply.
  3. I have tried the following approach:

    from pyspark.sql.functions import from_json, col
    from pyspark.sql.types import ArrayType, StringType
    data = [("1", "["a", "b", "c"]"), ("2", "["d", "e", "f"]")]
    df = spark.createDataFrame(data, ["id", "columns"])
    array_schema = ArrayType(StringType())
    df = df.withColumn("columns", from_json(col("columns"), array_schema))
    df.show()
    

    In the above code, I defined the schema for the array using ArrayType(StringType()). Next, I specified that the array contains strings.

    Using the withColumn method, combined with from_json, I transformed the "columns" column into an array of strings based on the specified schema.

    Results:

    +---+---------+
    | id|  columns|
    +---+---------+
    |  1|[a, b, c]|
    |  2|[d, e, f]|
    +---+---------+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search