skip to Main Content

I want to sort test last name and test2 last name so that the former comes before the latter. My understanding is that each character is compared from left to right until they differ and therefore the characters after those first differing characters do not matter anymore. However, as shown below, test comes before test2 but as soon as I append another character, the order changes. Why does this happen? What collation should I use to get the desired order? Note that converting them to bytea would yield the desired order.

test=# SELECT 'test last name' < 'test2 last name' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test last' < 'test2 last' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test ' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2  ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 ' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test  ' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test ' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test l' < 'test2 l' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test l' < 'test2l' COLLATE "en_US";
 ?column?
----------
 f
(1 row)

test=# SELECT 'test ' < 'test2l' COLLATE "en_US";
 ?column?
----------
 t
(1 row)

test=# SELECT 'test last name'::bytea < 'test2 last name'::bytea;
 ?column? 
----------
 t
(1 row)

2

Answers


  1. That’s how natural language collations work. If you want to compare character by character and have the space character be like other characters, use the C collation:

    SELECT 'test last name' < 'test2 last name' COLLATE "C";
    
     ?column? 
    ══════════
     t
    (1 row)
    

    But don’t complain if 'Z' < 'a'

    Login or Signup to reply.
  2. white space is special character in ICU collation.

    see demo: https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
    also here: http://www.unicode.org/reports/tr35/tr35-collation.html#table-collation-settings
    simple explanation: https://unicode-org.github.io/icu/userguide/collation/customization/ignorepunct.html#shift-trimmed

    You can following test:

    CREATE COLLATION coll_shifted(provider = icu, locale = 'en-u-ka-shifted');
    CREATE COLLATION coll_noignore(provider = icu, locale = 'en-u-ka-noignore');
    
    SELECT 'test last name' < 'test2 last name' COLLATE coll_shifted
    union all
    SELECT 'test last name' < 'test2 last name' COLLATE coll_noignore
    union all
    SELECT 'test last name' < 'test2 last name' COLLATE "en_US";
    

    if you just want compare by code pointer, you can use COLLATE "C" or COLLATE "POSIX".

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search