skip to Main Content

I have a DataFrame with a single column which is a struct type and contains an array.

users_tp_df.printSchema()
root
 |-- x: struct (nullable = true)
 |    |-- ActiveDirectoryName: string (nullable = true)
 |    |-- AvailableFrom: string (nullable = true)
 |    |-- AvailableFutureAllocation: long (nullable = true)
 |    |-- AvailableFutureHours: double (nullable = true)
 |    |-- CreateDate: string (nullable = true)
 |    |-- CurrentAllocation: long (nullable = true)
 |    |-- CurrentAvailableHours: double (nullable = true)
 |    |-- CustomFields: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)

I’m trying to convert the CustomFields array column in 3 three columns:

  1. Country;
  2. isExternal;
  3. Service.

So for example, I’ve these values:

enter image description here

and the final dataframe output excepted for that row will be:

enter image description here

Can anyone please help me in achieving this?

Thank you!

2

Answers


  1. Considering the mockup structure below, similar with the one from your example,
    you can do it the sql way by using the inline function:

    with alpha as (
    select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
                        , array(  named_struct("a", "x", "b", "y", "c", "z")
                                , named_struct("a", "xx", "b", "yy", "c","zz"))
                       ) root
    )
    select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c) 
    from alpha
    

    The result:
    enter image description here

    Mockup structure:

    mockup structure

    Login or Signup to reply.
  2. This would work:

    initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");
    
    final_df = initial_expansion
         .join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))
               .select("*", "CustomFields.*")
               .groupBy("id").pivot("Name").agg(F.first("Value")), 
            "id").drop("CustomFields")
    

    Sample Input:
    Json – {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}

    Input

    Input Structure:

    root
     |-- x: struct (nullable = true)
     |    |-- CurrentAvailableHours: integer (nullable = true)
     |    |-- CustomFields: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- Name: string (nullable = true)
     |    |    |    |-- Value: string (nullable = true)
    

    Output:
    Output

    Output Structure (Id can be dropped):

    root
     |-- id: long (nullable = false)
     |-- CurrentAvailableHours: integer (nullable = true)
     |-- Country: string (nullable = true)
     |-- Service: string (nullable = true)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search