selectively explode lists from Dataframe - Amazon Web Sevices - PhpOut

1131
March 16, 2023
252 views
1 vote
2 Answers

What is the best way to explode a comma separated column based on specific conditions?
I have some data in the following format:

ID	col1
1	a100,a101,b100,c100
2	a105,b100
3	b101, c104

what I want to achieve is to:

grab all the elements in col1 column that are starting with a and explode them.
If one record has none a### elements in col1 then I would still like to include them in the end results with a default value.
All the a### elements are in the beginning of the list, that means that once I read an element that starts with anything different than a, then I don’t have to parse the rest list.

Basically the wished outcome shall look like:

ID	element
1	a100
1	a101
2	a105
3	default

Tags: amazon-web-services apache-spark apache-spark-sql aws-glue pyspark

Answers

- arudsekaberne
- March 16, 2023 at 1:32 pm
- 0 votes
0
Your DataFrame (df_1):
```
+---+-------------------+
|ID |col1               |
+---+-------------------+
|1  |a100,a101,b100,c100|
|2  |a105,b100          |
|3  |b101,c104          |
+---+-------------------+
```
1. Use RDD functions to iterate through the col1 elements and filter accordingly
```
rdd = df_1.rdd.map(lambda row: (row[0], [value for value in row[1].split(",") if value[0].startswith("a")]))
```
1. convert RDD to DataFrame
```
df_2 = rdd.toDF(df_1.columns)
```
1. Use explode_outer() which returns a new row for each element in the given array and returns null for empty or null values
```
df_3 = df_2.withColumn("col1", explode_outer("col1"))
```
1. Replace the column col1 null values with ‘default’ string
```
df_3.fillna({"col1":"default"}).show()
```
Output
```
+---+-------+
| ID|   col1|
+---+-------+
|  1|   a100|
|  1|   a101|
|  2|   a105|
|  3|default|
+---+-------+
```
Login or Signup to reply.

- MattAndruff
- March 16, 2023 at 2:33 pm
- 0 votes
0
Your DataFrame (df_1): (warning I didn’t run this code but I think you get the idea of what I"m suggesting from the code itself)

I would explode and filter:
```
df_result = df_1.select(
  "ID", 
  F.explode(
   F.split(F.col("col1"), ",")
  ).alias("col1_exploded"))
  .filter(F.col("col1_exploded").startsWith("a"))
  .join( df_1, [ "ID" ], "left" )
  .select( F.col("ID"), F.coalesce( col("col1_exploded"), F.lit("default") ).alias("col1") )
```
I suggest using this over an rdd map function as it can be optimized by the compiler unlike a ‘udf’ which is what calling ‘map’ on an RDD actually creates. (The map function needs to be passed back to the python runtime for execution then the results passed back to SPARK runtime, where as the version above (Without map) is computed completely in the SPARK runtime without needing to pass back and forth to the python runtime.)

However even though it runs faster, if you feel the ‘map’ version is more readable, maintainable and you don’t need speed, then by all means use the approach suggested by @arudsekaberne. Just be aware of your choice and know that there is another approach should you need more speed.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.