skip to Main Content

What is the best way to explode a comma separated column based on specific conditions?
I have some data in the following format:

ID col1
1 a100,a101,b100,c100
2 a105,b100
3 b101, c104

what I want to achieve is to:

  1. grab all the elements in col1 column that are starting with a and explode them.
  2. If one record has none a### elements in col1 then I would still like to include them in the end results with a default value.
  3. All the a### elements are in the beginning of the list, that means that once I read an element that starts with anything different than a, then I don’t have to parse the rest list.

Basically the wished outcome shall look like:

ID element
1 a100
1 a101
2 a105
3 default

2

Answers


  1. Your DataFrame (df_1):

    +---+-------------------+
    |ID |col1               |
    +---+-------------------+
    |1  |a100,a101,b100,c100|
    |2  |a105,b100          |
    |3  |b101,c104          |
    +---+-------------------+
    
    1. Use RDD functions to iterate through the col1 elements and filter accordingly
    rdd = df_1.rdd.map(lambda row: (row[0], [value for value in row[1].split(",") if value[0].startswith("a")]))
    
    1. convert RDD to DataFrame
    df_2 = rdd.toDF(df_1.columns)
    
    1. Use explode_outer() which returns a new row for each element in the given array and returns null for empty or null values
    df_3 = df_2.withColumn("col1", explode_outer("col1"))
    
    1. Replace the column col1 null values with ‘default’ string
    df_3.fillna({"col1":"default"}).show()
    

    Output

    +---+-------+
    | ID|   col1|
    +---+-------+
    |  1|   a100|
    |  1|   a101|
    |  2|   a105|
    |  3|default|
    +---+-------+
    
    Login or Signup to reply.
  2. Your DataFrame (df_1): (warning I didn’t run this code but I think you get the idea of what I"m suggesting from the code itself)

    I would explode and filter:

    df_result = df_1.select(
      "ID", 
      F.explode(
       F.split(F.col("col1"), ",")
      ).alias("col1_exploded"))
      .filter(F.col("col1_exploded").startsWith("a"))
      .join( df_1, [ "ID" ], "left" )
      .select( F.col("ID"), F.coalesce( col("col1_exploded"), F.lit("default") ).alias("col1") )
    

    I suggest using this over an rdd map function as it can be optimized by the compiler unlike a ‘udf’ which is what calling ‘map’ on an RDD actually creates. (The map function needs to be passed back to the python runtime for execution then the results passed back to SPARK runtime, where as the version above (Without map) is computed completely in the SPARK runtime without needing to pass back and forth to the python runtime.)

    However even though it runs faster, if you feel the ‘map’ version is more readable, maintainable and you don’t need speed, then by all means use the approach suggested by @arudsekaberne. Just be aware of your choice and know that there is another approach should you need more speed.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search