My requirement is to strip surrogate code point characters from a Postgres column, because they are invalid XML characters and the data needs to be correct in XML. For example, these type of characters:
ubuntu@instenv-226703-n3fe:/Attachments$ awk '{FS="&"} {print $2 $3}' code.txt | sort
#55357;#56443; Order approved"" view=""fieldscreen"">
#55357;#56443; Request Approval Level 3"">
#55357;#56443; Request Board Approval Level"">
Now I have a regex which matches these:
[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]
but it doesn’t work directly in regex_replace SQL:
regexp_replace(descriptor, '[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]', '', 'g');
ERROR: invalid regular expression: invalid escape sequence
SQL state: 2201B
Is there some way to use this regex in the postgres function?
Thanks
2
Answers
Thanks, I thought this should work, but it doesn't replace them :( I verified also that
doesn't match the data either.
Try this,
postgreSQL
uses a different syntax for Unicode characters in regular expressions