I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.
Here is my example file (test.txt), I can’t include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41
This is my minimal code example (main.cpp), the one I’m actually using breaks down each line and does things with the different fields.
#include <string>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[]) {
setlocale(LC_ALL, "Chinese-traditional");
std::wstring wstr;
std::wifstream input_file("test.txt");
std::wofstream output_file("test_output.txt");
int counter = 0;
while(std::getline(input_file, wstr)) {
for(int i = 0; i < wstr.size(); i++) {
if(wstr[i] == L'|') {
counter++;
}
}
output_file << wstr << std::endl;
}
input_file.close();
output_file.close();
std::cout << counter << std::endl;
return 0;
}
To compile my program:
g++ -o test main.cpp -std=c++17
On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.
On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.
Here is what I tried:
My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.
Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.
3
Answers
setlocale
affects the locale of your program.It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.
Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc…).
Changing your program’s locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale’s character encoding.
It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.
You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.
A low cost, cheap shortcut, would be for your program to fork and run the
iconv
command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using
locale -a
. If it’s not listed, install it:(There’s a list of locales here with their names on Linux and Windows.)
Then use imbue with the input and output streams to set the locale of the streams.
By default,
std::wcout
is synchronized to the underlyingstdout
C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:Amended version of your code:
use
std::wcout
to printstd::wstring
instead ofstd::cout
🙂