Ubuntu - How to replace surrogate code point characters with regex_replace in postgres

user169573
January 18, 2024
275 views
0 votes
2 Answers

My requirement is to strip surrogate code point characters from a Postgres column, because they are invalid XML characters and the data needs to be correct in XML. For example, these type of characters:

ubuntu@instenv-226703-n3fe:/Attachments$ awk '{FS="&"} {print $2 $3}' code.txt | sort
#55357;#56443; Order approved"" view=""fieldscreen"">
#55357;#56443; Request Approval Level 3"">
#55357;#56443; Request Board Approval Level"">

Now I have a regex which matches these:

[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]

but it doesn’t work directly in regex_replace SQL:

regexp_replace(descriptor, '[x{D800}-x{DBFF}][x{DC00}-x{DFFF}]', '', 'g');

ERROR:  invalid regular expression: invalid escape  sequence 
SQL state: 2201B

Is there some way to use this regex in the postgres function?
Thanks

Tags: postgresql regex

Answers

Chosen as BEST ANSWER
- user169573
- January 18, 2024 at 2:12 pm
- 0 votes
0
Thanks, I thought this should work, but it doesn't replace them :( I verified also that
```
SELECT...where fieldname  ~* '\ud800-\udbff\udc00-\udfff';
```
doesn't match the data either.

(Edit)

- TSCAmericacom
- January 17, 2024 at 1:07 pm
- 0 votes
0
Try this, postgreSQL uses a different syntax for Unicode characters in regular expressions
```
regexp_replace(descriptor, '\ud800-\udbff\udc00-\udfff', '', 'g');
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – How to replace surrogate code point characters with regex_replace in postgres

Answers