First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I’m using Ubuntu for reference.
I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a ‘Ç’, the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:
int main (int argc, char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '')
{
printf ("%i ", argv[1][i]);
i++;
}
}
}
I did some research and found that i should try unsigned char argv instead of char, like this:
int main (int argc, unsigned char **argv)
{
int i;
i = 0;
if (argc == 2)
{
while (argv[1][i] != '')
{
printf("%i ", argv[1][i]);
i++;
}
}
}
In this case, I run the program with a ‘Ç’ and the output is 195 135 (still wrong).
How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.
Thank you!!
2
Answers
Your platform is using UTF-8 Encoding.
Unicode Latin Capital Letter C with Cedilla (U+00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.
In turn those bytes in decimal are 195 and 135 which you see in output.
Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127).
That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.
You may find there’s tools on your platform to convert that to extended ASCII but I suspect you don’t want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).
It’s Unicode Code Point 199 so unless you have a specific application for Extended ASCII you’ll probably just make things worse by converting to it. That’s not least because it’s a much smaller set of characters than Unicode.
Here’s some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm
There are various ways of representing non-ASCII characters, such as
Ç
. Your question suggests you’re familiar with 8-bit character sets such as ISO-8859, where in several of its variantsÇ
does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.
In C, one way to extract wide characters from a multibyte character string is the function
mbtowc
. Here is a modification of your program, using this function:You give
mbtowc
a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variablewc
. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.When I run this program on the string
abÇd
, it printsThis shows that in Unicode (just like 8859-1),
Ç
has the code 199, but it takes two bytes to encode it.Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as
$LANG
. That’s what the callsetlocale(LC_CTYPE, "")
is for: it tells the C library to pay attention to the environment to select a locale for the program’s functions, likembtowc
, to use.Unicode is of course huge, encoding thousands and thousands of characters. Here’s the output of the modified version of your program on the string "abΣ∫😊":
Emoji like 😊 typically take four bytes to encode in UTF-8.