skip to Main Content

I have created a text file with following characters for testing utf-8 encoding:

%gÁüijȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒

I also have written this program in C to open file and read it:

#pragma warning(disable:4996)

#include <stdio.h>
#include <stdlib.h>

int main() {
    FILE *ptr;
    ptr = fopen("inputtest.txt", "r, ccs=UTF-8");
    char input[50];
    if (ptr == NULL)
        perror("Error opening file");
    else {
        if (fgets(input, 50, ptr) != NULL) {
            puts(input);
        }
        printf(input);
        fclose(ptr);
    }
}

If I don’t use ccs=UTF-8, I will get some unreadable characters. But with it, the program crashes with code -1073740791. Also after using wchar_t and fgetws the program’s output was just %.
Note: I am using windows 11 and visual studio 2022 and I need to input multi-language characters.

2

Answers


  1. You call printf(input) in all cases:

    • the contents of input are indeterminate when fgets() fails. Calling printf with such a format string has undefined behavior.
    • if fgets() succeeded, calling printf with a string read from a file is a risky thing. If the string contains a % sign not immediately followed by another %, the behavior is undefined as printf will look for variable arguments you did not pass.

    It is unclear why you have this call, you should just remove it.

    Regarding the program’s output, whether it will be readable or not depends on the selected terminal encoding.


    The extra , ccs=UTF-8 attribute in the fopen mode string is a Microsoft extension to try and help programmers deal with text file encodings. It is sad to see how much energy has been wasted over the last 30 years to deal with 8-bit to 16-bit conversions for API calls and the like when the rest of the world standardized on UTF-8.

    Login or Signup to reply.
  2. Consider using fgetws(3), instead, and using setlocale(3) prior to that. For one byte characters, you are limited to ascii or at most one byte characters. And of course, use wchar_t characters, instead of char.

    But, if you use utf-8 encoding, all bytes are read as bytes, and can be printed as bytes. You can read and write those without interpreting them (except, of course if you want to interpret them):

    #include <stdio.h>
    int main()
    {
        int c;
        while ((c = fgetc(stdin)) != EOF)
            putchar(c);
    }
    

    should work:

    $ a.out <<EOF
    > %gÁüijȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒
    > EOF
    %gÁüijȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒
    $ _
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search