skip to Main Content

I have a DF column which is a long strings with comma separated values, like:

  • 2000,2001,2002:a,2003=b,2004,100,101,500,20
  • 101,102,20

What I want to do is to create a new Array<Int> column out of it where:

  1. only values starting with 2 are included
  2. when a value has additional delimiter then only the first part will be returned (e.g. 2002)
  3. some specific values will be excluded (let’s say value = 20)
  4. if the array is empty it should be filled with a default value (let’s say [199])

So basically the 2 test strings should be returned as:

stringColumn arrayColumn
2000,2001,2002:a,2003=b,2004,100,101,500,20 [2000,2001,2002,2003,2004]
101,102,20 [199]

2

Answers


  1. Just divide the problem into subproblems:
    a) split string to substrings
    b) try to find the number in every substring
    c) if number is in substring then check if it meets the conditions

    e.g.:

    import re
    
    pattern = r'2[0-9]*'
    excluded = (20, )
    
    long_str = "2000,2001,2002:a,2003=b,2004,100,101,500,20"
    
    
    def foo(l_str: str, pattern=pattern, excluded=excluded):
        result = []
        for num in l_str.split(','):
            search_res = re.search(pattern, num)
            if search_res and
                    (x := int(search_res.group())) not in excluded:
                result.append(x)
        return result
        
    foo(long_str)
    
    Login or Signup to reply.
  2. I would prefer a custom user defined function to extract and filter values. Here is an approach:

    import re
    
    @F.udf(returnType='array<int>')
    def find_all(s):
        exclude = ['20']
        items = re.findall(r'b(2d+)b', s)
        return [int(x) for x in items if x not in exclude] or [199]
    
    
    df = df.withColumn('arrayColumn', find_all('stringColumn'))
    

    Result

    df.show()
    
    +--------------------+--------------------+
    |        stringColumn|         arrayColumn|
    +--------------------+--------------------+
    |2000,2001,2002:a,...|[2000, 2001, 2002...|
    |          101,102,20|               [199]|
    +--------------------+--------------------+
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search