skip to Main Content

My requirement is to strip surrogate code point characters from a Postgres column, because they are invalid XML characters and the data needs to be correct in XML. For example, these type of characters:

ubuntu@instenv-226703-n3fe:/Attachments$ awk '{FS="&"} {print $2 $3}' code.txt | sort
#55357;#56443; Order approved"" view=""fieldscreen"">
#55357;#56443; Request Approval Level 3"">
#55357;#56443; Request Board Approval Level"">

Now I have a regex which matches these:

[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]

but it doesn’t work directly in regex_replace SQL:

regexp_replace(descriptor, '[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]', '', 'g');

ERROR:  invalid regular expression: invalid escape  sequence 
SQL state: 2201B

Is there some way to use this regex in the postgres function?
Thanks

2

Answers


  1. Chosen as BEST ANSWER

    Thanks, I thought this should work, but it doesn't replace them :( I verified also that

    SELECT...where fieldname  ~* '\ud800-\udbff\udc00-\udfff';
    

    doesn't match the data either.


  2. Try this, postgreSQL uses a different syntax for Unicode characters in regular expressions

    regexp_replace(descriptor, '\ud800-\udbff\udc00-\udfff', '', 'g');
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search