Visual Studio Code - Reading utf-8 encoded files with fopen C

Vovchisk
January 24, 2024
146 views
0 votes
2 Answers

I have created a text file with following characters for testing utf-8 encoding:

%gÁüĳȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒

I also have written this program in C to open file and read it:

#pragma warning(disable:4996)

#include <stdio.h>
#include <stdlib.h>

int main() {
    FILE *ptr;
    ptr = fopen("inputtest.txt", "r, ccs=UTF-8");
    char input[50];
    if (ptr == NULL)
        perror("Error opening file");
    else {
        if (fgets(input, 50, ptr) != NULL) {
            puts(input);
        }
        printf(input);
        fclose(ptr);
    }
}

If I don’t use ccs=UTF-8, I will get some unreadable characters. But with it, the program crashes with code -1073740791. Also after using wchar_t and fgetws the program’s output was just %.
Note: I am using windows 11 and visual studio 2022 and I need to input multi-language characters.

Tags: c#fopen utf-8

Answers

- chqrlie
- January 21, 2024 at 8:56 pm
- 0 votes
0
You call printf(input) in all cases:
- the contents of input are indeterminate when fgets() fails. Calling printf with such a format string has undefined behavior.
- if fgets() succeeded, calling printf with a string read from a file is a risky thing. If the string contains a % sign not immediately followed by another %, the behavior is undefined as printf will look for variable arguments you did not pass.
It is unclear why you have this call, you should just remove it.

Regarding the program’s output, whether it will be readable or not depends on the selected terminal encoding.

_{The extra , ccs=UTF-8 attribute in the fopen mode string is a Microsoft extension to try and help programmers deal with text file encodings. It is sad to see how much energy has been wasted over the last 30 years to deal with 8-bit to 16-bit conversions for API calls and the like when the rest of the world standardized on UTF-8.}
Login or Signup to reply.

- LuisColorado
- January 24, 2024 at 9:08 am
- 0 votes
0
Consider using fgetws(3), instead, and using setlocale(3) prior to that. For one byte characters, you are limited to ascii or at most one byte characters. And of course, use wchar_t characters, instead of char.

But, if you use utf-8 encoding, all bytes are read as bytes, and can be printed as bytes. You can read and write those without interpreting them (except, of course if you want to interpret them):
```
#include <stdio.h>
int main()
{
    int c;
    while ((c = fgetc(stdin)) != EOF)
        putchar(c);
}
```
should work:
```
$ a.out <<EOF
> %gÁüĳȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒
> EOF
%gÁüĳȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒
$ _
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Visual Studio Code – Reading utf-8 encoded files with fopen C

Answers