PostgreSQL is icu collation 'de-u-kn-true-ks-level2' deterministic?

Tom
February 28, 2023
242 views
0 votes
2 Answers

For my database I want natural (‘A2′<‘A12’), case-insensitive sorting, so for testing I created a new collation with

CREATE COLLATION tomscollation (provider = icu, locale = 'de-u-kn-true-ks-level2');

My end goal is to use such a collation as the default collation for the whole database so that every text column or index uses it by default.

This source notes that for PostgreSQL 15 non-deterministic collations are not yet supported at database level.
This is where my confusion starts. Looking at my collation the database says its deterministic (see screenshot). However the documentation says that only byte-order collations can be deterministic but mine is clearly not byte-order.

So what am I missing here?
Why am I able to create the following database and what are the side effects I might be unaware of?

CREATE DATABASE tomsdb TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'de-u-kn-true-ks-level2'

Answers

- jian
- February 28, 2023 at 6:57 am
- 0 votes
0
kn=true means that Unicode Locale Extension numeric ordering will be default.

"non-deterministic collations are not yet supported at database level." means that by default, select 'a' = 'A'; will return false.
```
CREATE DATABASE tomsdb TEMPLATE template0 
    LOCALE_PROVIDER icu
    ICU_LOCALE 'de-u-kn-true-ks-level2'
    LOCALE 'de_DE.utf8'
    ;
```
```
select 'a1' = 'A1'
union all
select 'a2' > 'A12'
union all
select 'A2' > 'A12';
```
it return:
```
 ?column?
----------
 f
 f
 f
(3 rows)
```
so on a database level, kn=true did work as expected. but ks-level2 should by default make select 'a1' = 'A1' return true.

so you need create collation to do case insensitive sorting, like following way:
```
CREATE COLLATION case_insensitive (
    provider = icu, locale = 'und-u-ks-level2', deterministic = false);

select 'a1' = 'A1' COLLATE case_insensitive;
```
Login or Signup to reply.

- LaurenzAlbe
- February 28, 2023 at 8:39 am
- 0 votes
0
de-u-kn-true-ks-level2 is by definition not deterministic. If you use a non-deterministic collation in a context where you need a deterministic one (like as a database collation, or if you defined the collation as DETERMINISTIC = TRUE), PostgreSQL will break the tie by using memcmp if the two strings compare equal, but are not identical.

That means for example that if you use that collation for a database, you will get 'A1' < 'a1', because memcmp ranks the lower ASCII value of A before the higher one of a.

See the code of varstr_cmp() in src/backend/utils/adt/varlena.c:
```
/* Break tie if necessary. */
if (result == 0 &&
    (!mylocale || mylocale->deterministic))
    result = strcmp(a1p, a2p);
```
Since you asked in your comment: if you mark a collation as DETERMINISTIC, strings that are byte-wise different will never compare as equal.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.