I have created a text file with following characters for testing utf-8 encoding:
%gÁüijȐʨΘЋЮѦҗԘՔהڳضणணษ༒Ⴃᎃᡧᬐ⁜₪≸☺⛜⺟むヸ㒦㢒
I also have written this program in C to open file and read it:
#pragma warning(disable:4996)
#include <stdio.h>
#include <stdlib.h>
int main() {
FILE *ptr;
ptr = fopen("inputtest.txt", "r, ccs=UTF-8");
char input[50];
if (ptr == NULL)
perror("Error opening file");
else {
if (fgets(input, 50, ptr) != NULL) {
puts(input);
}
printf(input);
fclose(ptr);
}
}
If I don’t use ccs=UTF-8
, I will get some unreadable characters. But with it, the program crashes with code -1073740791
. Also after using wchar_t
and fgetws
the program’s output was just %
.
Note: I am using windows 11 and visual studio 2022 and I need to input multi-language characters.
2
Answers
You call
printf(input)
in all cases:input
are indeterminate whenfgets()
fails. Callingprintf
with such a format string has undefined behavior.fgets()
succeeded, callingprintf
with a string read from a file is a risky thing. If the string contains a%
sign not immediately followed by another%
, the behavior is undefined asprintf
will look for variable arguments you did not pass.It is unclear why you have this call, you should just remove it.
Regarding the program’s output, whether it will be readable or not depends on the selected terminal encoding.
The extra
, ccs=UTF-8
attribute in thefopen
mode string is a Microsoft extension to try and help programmers deal with text file encodings. It is sad to see how much energy has been wasted over the last 30 years to deal with 8-bit to 16-bit conversions for API calls and the like when the rest of the world standardized on UTF-8.Consider using
fgetws(3)
, instead, and usingsetlocale(3)
prior to that. For one byte characters, you are limited to ascii or at most one byte characters. And of course, usewchar_t
characters, instead ofchar
.But, if you use utf-8 encoding, all bytes are read as bytes, and can be printed as bytes. You can read and write those without interpreting them (except, of course if you want to interpret them):
should work: