skip to Main Content

I have below dataset,

1. '{SHEET,2730377,SHEET,5708283,DATA,3015937}'
2. '{SHEET,2730377,SHEET,5708283,DATA,3015937,DATA,0010965}'
3. '{SHEET,5708283,DATA,3015937,DATA,0010965}'
3. '{SHEET,5708283,DATA,3015937}'

I need the result as,

1. {DATA,3015937}
2. {DATA,3015937,DATA,0010965}
3. {DATA,3015937,DATA,0010965}
4. {DATA,3015937}

Basically I wanted to replace SHEET followed by 7 digit number and extract only DATA followed by 7 digit number.

I tried using regexp_replace('{SHEET,2730377,SHEET,5708283,DATA,3015937}', 'SHEET,(ddddddd),', '' ) the result of this is {SHEET,5708283,DATA,3015937}, as well as using similar to [0-9][0-9][0-9][0-9][0-9][0-9][0-9]

Finding difficulties to get the accurate result, any suggestion would be appreciated

2

Answers


  1. If, as in the examples,

    1. there are one or "DATA" pairs that are contiguous and are followed by "}' at the end of the string; and

    2. there is at least one "STREET" pair that precedes the first "DATA" pair,

    then you may convert matches of the following regular expression to empty strings:

    ^[^']*?'|{ '|(?:SHEET,d{7},)+(?=DATA,d{7}b)|'$
    

    Demo


    I understand Postgres supports lookaheads.

    The expression can be broken down as follows.

    ^                   # match the beginning of the string
      [^']*?            # match zero or more characters other than single 
                        # quotes, as few as possible        
      '                 # match a single quote
    |                   # or
      { '               # match '{' followed by a space then a single quote 
    |                   # or
      (?:               # begin a non-capture group 
        SHEET,d{7},    # match 'SHEET' followed by a comma then 7 digits
                        # then a comma
      )+                # end non-capture group and execute it one or more times,
                        # as many as possible
      (?=               # begin a positive lookahead
        bDATA,d{7}b  # match a word boundary then 'DATA,' then a comma
                        # then 7 digits then a word boundary
      )                 # end the positive lookahead
    |                   # or
      '$                # match a single quote at the end of the string
    
    Login or Signup to reply.
  2. Instead of REGEXP_REPLACEing what you want to remove, just SELECT what you want to extract as follows (all of the code below is available on the fiddle here):

    CREATE TABLE t
    (
      str TEXT NOT NULL
    );
    

    populate:

    INSERT INTO t (str) VALUES
    ('{SHEET,2730377,SHEET,5708283,DATA,3015937}'),
    ('{SHEET,2730377,SHEET,5708283,DATA,3015937,DATA,0010965}'),
    ('{SHEET,5708283,DATA,3015937,DATA,0010965}'),
    ('{SHEET,5708283,DATA,3015937}');
    

    and then we run the following:

    SELECT
      '{' || ARRAY_TO_STRING(REGEXP_MATCHES(str, '(DATA,.*})+$', 'g'), ',') AS result
    FROM
      t;
    

    Result:

    result
    {DATA,3015937}
    {DATA,3015937,DATA,0010965}
    {DATA,3015937,DATA,0010965}
    {DATA,3015937}
    

    Et voilĂ !

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search