I would like to trim()
a column and to replace any multiple white spaces and Unicode space separators
to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar
(SPACE u+20
) vs foo bar
(NO-BREAK SPACE u+A0
).
Until now I’ve used SELECT regexp_replace(TRIM('some string'), '[sv]+', ' ', 'g');
it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp h
, but PostgreSQL doesn’t support it (neither p{Zs}
):
SELECT regexp_replace(TRIM('some string'), '[svh]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1
) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
3
Answers
You may construct a bracket expression including the whitespace characters from
p{Zs}
Unicode category + a tab:It will replace all occurrences of one or more horizontal whitespaces (match by
h
in other regex flavors supporting it) with a regular space char.Based on the Posix "space" character-class (class shorthand
s
in Postgres regular expressions), UNICODE "Spaces", some space-like "Format characters", and some additional non-printing characters (finally added two more from Wiktor’s post), I condensed this custom character class:So use:
Note:
trim()
comes afterregexp_replace()
, so it covers converted spaces.It’s important to include the basic space class
s
(short for[[:space:]]
to cover all current (and future) basic space characters.We might include more characters. Or start by stripping all characters encoded with 4 bytes. Because UNICODE is dark and full of terrors.
Consider this demo:
Tool to generate the character class:
db<>fiddle here
Related, with more explanation:
Compiling blank characters from several sources, I’ve ended up with the following pattern which includes tabulations (
U+0009
/U+000B
/U+0088-008A
/U+2409-240A
), word joiner (U+2060
), space symbol (U+2420
/U+2423
), braille blank (U+2800
), tag space (U+E0020
) and more:And in order to effectively transform blanks including multiple consecutive spaces and those at the beginning/end of a column, here are the 3 queries to be executed in sequence (assuming column "text" from "mytable")