skip to Main Content

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).

Until now I’ve used SELECT regexp_replace(TRIM('some string'), '[sv]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.

I would have added to the regexp h, but PostgreSQL doesn’t support it (neither p{Zs}):

SELECT regexp_replace(TRIM('some string'), '[svh]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape  sequence

We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.

I there a way to achieve something similar?

3

Answers


  1. You may construct a bracket expression including the whitespace characters from p{Zs} Unicode category + a tab:

    REGEXP_REPLACE(col, '[u0009u0020u00A0u1680u2000-u200Au202Fu205Fu3000]+', ' ', 'g')
    

    It will replace all occurrences of one or more horizontal whitespaces (match by h in other regex flavors supporting it) with a regular space char.

    Login or Signup to reply.
  2. Based on the Posix "space" character-class (class shorthand s in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor’s post), I condensed this custom character class:

    '[su00a0u180eu2007u200b-u200fu202fu2060ufeff]'
    

    So use:

    SELECT trim(regexp_replace('some string', '[su00a0u180eu2007u200b-u200fu202fu2060ufeff]+', ' ', 'g'));
    

    Note: trim() comes after regexp_replace(), so it covers converted spaces.

    It’s important to include the basic space class s (short for [[:space:]] to cover all current (and future) basic space characters.

    We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.

    Consider this demo:

    SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
         , 'u' || lpad(to_hex(d), 4, '0') AS unicode
         , chr(d) ~ 's' AS in_posix_space_class
         , chr(d) ~ '[su00a0u180eu2007u200b-u200fu202fu2060ufeff]' AS in_custom_class
    FROM  (
       -- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
       -- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
       SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
       UNION ALL
       SELECT generate_series (8192, 8202) AS dec  -- UNICODE "Spaces"
       UNION ALL
       SELECT generate_series (8203, 8207) AS dec  -- First 5 space-like UNICODE "Format characters"
       ) t(d)
    ORDER  BY d;
    
     decimal | hex  |  glyph   | unicode | in_posix_space_class | in_custom_class 
    ---------+------+----------+---------+----------------------+-----------------
           9 | 9    |          | u0009  | t                    | t
          32 | 20   |          | u0020  | t                    | t
         160 | a0   |          | u00a0  | f                    | t
        5760 | 1680 |          | u1680  | t                    | t
        6158 | 180e | ᠎        | u180e  | f                    | t
        8192 | 2000 |          | u2000  | t                    | t
        8193 | 2001 |          | u2001  | t                    | t
        8194 | 2002 |          | u2002  | t                    | t
        8195 | 2003 |          | u2003  | t                    | t
        8196 | 2004 |          | u2004  | t                    | t
        8197 | 2005 |          | u2005  | t                    | t
        8198 | 2006 |          | u2006  | t                    | t
        8199 | 2007 |          | u2007  | f                    | t
        8200 | 2008 |          | u2008  | t                    | t
        8201 | 2009 |          | u2009  | t                    | t
        8202 | 200a |          | u200a  | t                    | t
        8203 | 200b | ​        | u200b  | f                    | t
        8204 | 200c | ‌        | u200c  | f                    | t
        8205 | 200d | ‍        | u200d  | f                    | t
        8206 | 200e | ‎        | u200e  | f                    | t
        8207 | 200f | ‏        | u200f  | f                    | t
        8239 | 202f |          | u202f  | f                    | t
        8287 | 205f |          | u205f  | t                    | t
        8288 | 2060 | ⁠        | u2060  | f                    | t
       12288 | 3000 |         | u3000  | t                    | t
       65279 | feff |         | ufeff  | f                    | t
    (26 rows)
    

    Tool to generate the character class:

    SELECT '[s' || string_agg('u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
    FROM  (
       SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
       UNION ALL
       SELECT generate_series (8192, 8202)
       UNION ALL
       SELECT generate_series (8203, 8207)
       ) t(d)
    WHERE  chr(d) !~ 's'; -- not covered by s
    
    [su00a0u180eu2007u200bu200cu200du200eu200fu202fu2060ufeff]
    

    db<>fiddle here

    Related, with more explanation:

    Login or Signup to reply.
  3. Compiling blank characters from several sources, I’ve ended up with the following pattern which includes tabulations (U+0009 / U+000B / U+0088-008A / U+2409-240A), word joiner (U+2060), space symbol (U+2420 / U+2423), braille blank (U+2800), tag space (U+E0020) and more:

    [x0009x000Bx0088-x008Ax00A0x1680x180Ex2000-x200Fx202Fx205Fx2060x2409x240Ax2420x2423x2800x3000xFEFFxE0020]
    

    And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")

    -- transform all Unicode blanks/spaces into a "regular" one (U+20) only on lines where "text" matches the pattern
    UPDATE
        mytable
    SET
        text = regexp_replace(text, '[x0009x000Bx0088-x008Ax00A0x1680x180Ex2000-x200Fx202Fx205Fx2060x2409x240Ax2420x2423x2800x3000xFEFFxE0020]', ' ', 'g')
    WHERE
        text ~ '[x0009x000Bx0088-x008Ax00A0x1680x180Ex2000-x200Fx202Fx205Fx2060x2409x240Ax2420x2423x2800x3000xFEFFxE0020]';
    
    -- then squeeze multiple spaces into one
    UPDATE mytable SET text=regexp_replace(text, '[ ]+ ',' ','g') WHERE text LIKE '%  %';
    
    -- and finally, trim leading/ending spaces
    UPDATE mytable SET text=trim(both ' ' FROM text) WHERE text LIKE ' %' OR text LIKE '% ';
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search