skip to Main Content

I am writing a program that builds a word list for all words in a document, book or song. The book I started with was Charles Dickens "A Tale of Two Cities". This was downloaded from the Project Gutenburg web site. It is formatted as a UTF-8 file.

I am running Ubuntu 22.04.

I have come across a number of seemingly innocuous characters in the text that clearly are not standard ASCII.

In the code below the offending characters are the left and right quotation marks around "Wo-ho" and the single quotation mark in you’re.

If I go through the line character by character I can see that these are non ASCII and produce strange ASCII codes of 30, 128, 100 for the left quote, 30, 128, 99 for the right quote and 30, 128, 103 for the single quote.

If someone could help me understand why this is so it would be appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int main(int argc, char *argv[])
{
    char            *lineIn     = "“Wo-ho!” said the coachman. “So, then! One more pull and you’re at the";
    const   char    delim       = ' ';
    int             strLen      = strlen(lineIn);
    int             i           = 0;

    printf("Start - %sn", lineIn);
    printf("n");

    for (i = 0; i < strLen + 1; i++)
    {   
         
        if (isalpha(lineIn[i])) {
            printf("Alpha - (%c) ", lineIn[i]);
        } else if (iscntrl(lineIn[i])) {
            printf("Cntrl - %cn", lineIn[i]);
        } else if (isxdigit(lineIn[i])) {
            printf("Hex - %cn", lineIn[i]);
        } else if (isascii(lineIn[i])) {
            printf("Asc - %cn", lineIn[i]);
        } else if (lineIn[i] == delim) {
            printf("n");
        } else { 
            printf("Unk - %dn", lineIn[i]);
        }
    }   

    return 0;
}

Output from above :

Start - “Wo-ho!” said the coachman. “So, then! One more pull and you’re at the

Unk - -30
Unk - -128
Unk - -100
Alpha - (W) Alpha - (o) Asc - -
Alpha - (h) Alpha - (o) Asc - !
Unk - -30
Unk - -128
Unk - -99
Asc -  
Alpha - (s) Alpha - (a) Alpha - (i) Alpha - (d) Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Asc -  
Alpha - (c) Alpha - (o) Alpha - (a) Alpha - (c) Alpha - (h) Alpha - (m) Alpha - (a) Alpha - (n) Asc - .
Asc -  
Unk - -30
Unk - -128
Unk - -100
Alpha - (S) Alpha - (o) Asc - ,
Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Alpha - (n) Asc - !
Asc -  
Alpha - (O) Alpha - (n) Alpha - (e) Asc -  
Alpha - (m) Alpha - (o) Alpha - (r) Alpha - (e) Asc -  
Alpha - (p) Alpha - (u) Alpha - (l) Alpha - (l) Asc -  
Alpha - (a) Alpha - (n) Alpha - (d) Asc -  
Alpha - (y) Alpha - (o) Alpha - (u) Unk - -30
Unk - -128
Unk - -103
Alpha - (r) Alpha - (e) Asc -  
Alpha - (a) Alpha - (t) Asc -  
Alpha - (t) Alpha - (h) Alpha - (e) Cntrl -

2

Answers


  1. You do no have have ASCII. You have UTF-8.

    • -30 -128 -100 (E2.80.9C as hex) is the UTF-8 encoding of U+201C LEFT DOUBLE QUOTATION MARK (“)
    • -30 -128 -99 (E2.80.9D as hex) is the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK (”)

    and are not found in the ASCII character set.

    Login or Signup to reply.
  2. As you mentioned, you have a UTF-8 encoded text file. This means you can have characters that are up to 4 bytes long, though the first 128 characters in Unicode UTF-8 are a one-to-one mapping of ASCII (requiring only 1 byte). This means, if the text is English, you will probably be able to see all the letters as if they were ASCII.

    You can read Wikipedia’s description of how UTF-8 is encoded.

    From the table on Encoding, you can see that if the byte you read is in one of the following ranges

    UTF-8        character
    first byte   size
    
    0xxxxxxx      1 byte
    110xxxxx      2 bytes
    1110xxxx      3 bytes
    11110xxx      4 bytes
    

    the character will be 1, 2, 3, or 4 bytes respectively. This will come in handy if you decide to "translate" the character to some ASCII equivalent, or if you want to filter them out (remove) completely.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search