I have not been able to get locale-dependent functions such as strcoll() to work in C.
I am wondering whether I am doing something wrong and/or how to get this to work.
Here is a sample program from this book:
Prinz, Peter, and Tony Crawford. 2016. C in a Nutshell, 2nd edn., p. 574. Beijing-Boston-Farnham-Sebastopol-Tokyo: O’Reilly. ISBN-13: 978-1-491-90475-6.
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void) {
char *samples[ ] = { "curso", "churro" };
setlocale(LC_COLLATE, "es_ES.UTF-8");
int result = strcoll(samples[0], samples[1]);
if(result == 0) {
printf("The strings "%s" and "%s" are "
"alphabetically equivalent.n",
samples[0], samples[1]);
} else if(result < 0) {
printf("The string "%s" comes before "%s" "
"alphabetically.n",
samples[0], samples[1]);
} else if(result > 0) {
printf("The string "%s" comes after "%s" "
"alphabetically.n",
samples[0], samples[1]);
}
return(0);
}
The book says that "curso" should come BEFORE "churro", because in Spanish "ch" is considered a separate letter for purposes of alphabetization. However, when I run this program it prints that "curso" comes AFTER "churro". I do not know Spanish, but I have tested this program with several other languages that I do know, and the result is always that of strcmp(), a strictly numerical comparison.
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
$ locale -a | grep es_ES.utf8
es_ES.utf8
I am aware of this question:
Getting locale functions to work in glibc
The author says that locale-dependent functions such as strcoll perform poorly in glibc, and that he was writing his own modifications of it.
Am I missing something? Does this simply not work?
2
Answers
I think it comes down to whether or not your environment recognizes "es_ES.UTF-8"
Note, I do not have access to a Linux environment, which waters down the ability to compare apples with apples. But I hope the following highlights a few things that might help…
On Windows, and using a standard LabWindows/CVI compiler (my version is based on Clang 3.3) it outputs the following:
which appears to be incorrect according to your stated expectations when using the Spanish alphabetization rules.
I suspect implementation and version of libraries contribute to what we are seeing.
Note that when later I checked the return of
setlocale
:It came back NULL, indicating the following:
indicating that "es_ES.UTF-8" was not honored, leaving locale unchanged.
This article has some interesting and related insights into using UTF-8 in C. (…and how it relates to the locale problems seen here.)
Your book has outdated information. The Spanish digraph
ch
is not considered a single letter since 1994. See https://rae.es/dpd/abecedario.(Hope no translation is needed)
You can also look at the Unicode collation data here. This is the source glibc derives its collation data from. As you can see, there are several collation orders. The standard one does not consider
ch
andll
special, while the traditional one does. Glibc implements the standard collation.You can check that your Spanish locale collation is working by trying strings with accented characters. Those should come in the order described by the collation order (i.e. right after the corresponding non-accented character) if the system is working, and after all non-accented letters if it does not (i.e. if you forget to call
setlocale
or the locale is not supported). Demo Note, on godbolt GCC does not support locales, while MSVC does (and with the Unix-like locale names to boot).If you want to test multi-character collation, use the Czech locale (
cs_CZ.UTF-8
), it does recognisech
as a single letter and it comes afterh
in the collation order. Demo.