I am writing a program that builds a word list for all words in a document, book or song. The book I started with was Charles Dickens "A Tale of Two Cities". This was downloaded from the Project Gutenburg web site. It is formatted as a UTF-8 file.
I am running Ubuntu 22.04.
I have come across a number of seemingly innocuous characters in the text that clearly are not standard ASCII.
In the code below the offending characters are the left and right quotation marks around "Wo-ho" and the single quotation mark in you’re.
If I go through the line character by character I can see that these are non ASCII and produce strange ASCII codes of 30, 128, 100 for the left quote, 30, 128, 99 for the right quote and 30, 128, 103 for the single quote.
If someone could help me understand why this is so it would be appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
int main(int argc, char *argv[])
{
char *lineIn = "“Wo-ho!” said the coachman. “So, then! One more pull and you’re at the";
const char delim = ' ';
int strLen = strlen(lineIn);
int i = 0;
printf("Start - %sn", lineIn);
printf("n");
for (i = 0; i < strLen + 1; i++)
{
if (isalpha(lineIn[i])) {
printf("Alpha - (%c) ", lineIn[i]);
} else if (iscntrl(lineIn[i])) {
printf("Cntrl - %cn", lineIn[i]);
} else if (isxdigit(lineIn[i])) {
printf("Hex - %cn", lineIn[i]);
} else if (isascii(lineIn[i])) {
printf("Asc - %cn", lineIn[i]);
} else if (lineIn[i] == delim) {
printf("n");
} else {
printf("Unk - %dn", lineIn[i]);
}
}
return 0;
}
Output from above :
Start - “Wo-ho!” said the coachman. “So, then! One more pull and you’re at the
Unk - -30
Unk - -128
Unk - -100
Alpha - (W) Alpha - (o) Asc - -
Alpha - (h) Alpha - (o) Asc - !
Unk - -30
Unk - -128
Unk - -99
Asc -
Alpha - (s) Alpha - (a) Alpha - (i) Alpha - (d) Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Asc -
Alpha - (c) Alpha - (o) Alpha - (a) Alpha - (c) Alpha - (h) Alpha - (m) Alpha - (a) Alpha - (n) Asc - .
Asc -
Unk - -30
Unk - -128
Unk - -100
Alpha - (S) Alpha - (o) Asc - ,
Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Alpha - (n) Asc - !
Asc -
Alpha - (O) Alpha - (n) Alpha - (e) Asc -
Alpha - (m) Alpha - (o) Alpha - (r) Alpha - (e) Asc -
Alpha - (p) Alpha - (u) Alpha - (l) Alpha - (l) Asc -
Alpha - (a) Alpha - (n) Alpha - (d) Asc -
Alpha - (y) Alpha - (o) Alpha - (u) Unk - -30
Unk - -128
Unk - -103
Alpha - (r) Alpha - (e) Asc -
Alpha - (a) Alpha - (t) Asc -
Alpha - (t) Alpha - (h) Alpha - (e) Cntrl -
2
Answers
You do no have have ASCII. You have UTF-8.
“
and”
are not found in the ASCII character set.As you mentioned, you have a UTF-8 encoded text file. This means you can have characters that are up to 4 bytes long, though the first 128 characters in Unicode UTF-8 are a one-to-one mapping of ASCII (requiring only 1 byte). This means, if the text is English, you will probably be able to see all the letters as if they were ASCII.
You can read Wikipedia’s description of how UTF-8 is encoded.
From the table on Encoding, you can see that if the byte you read is in one of the following ranges
the character will be 1, 2, 3, or 4 bytes respectively. This will come in handy if you decide to "translate" the character to some ASCII equivalent, or if you want to filter them out (remove) completely.