skip to Main Content

I got issues reading a file that contains chinese characters. I know that the encoding of the file is Big5.

Here is my example file (test.txt), I can’t include it here because of the chinese characters: https://gist.github.com/haruka98/974ca2c034ebd8fe7eeac4124739fc41

This is my minimal code example (main.cpp), the one I’m actually using breaks down each line and does things with the different fields.

#include <string>
#include <fstream>
#include <iostream>

int main(int argc, char* argv[]) {
    setlocale(LC_ALL, "Chinese-traditional");
    std::wstring wstr;
    std::wifstream input_file("test.txt");
    std::wofstream output_file("test_output.txt");
    int counter = 0;
    while(std::getline(input_file, wstr)) {
        for(int i = 0; i < wstr.size(); i++) {
            if(wstr[i] == L'|') {
                counter++;
            }
        }
        output_file << wstr << std::endl;
    }
    input_file.close();
    output_file.close();
    std::cout << counter << std::endl;
    return 0;
}

To compile my program:

g++ -o test main.cpp -std=c++17

On Windows 10 I got my expected output. I got the entire file copied to "test_output.txt" and the 129 output in the terminal.

On Linux (Debian 9) I got the terminal output 4 and the file "test_output.txt" only contains the first line and the "1|" from the second.

Here is what I tried:

My first guess was the CR LF and LF issue when using both Windows and Linux. But testing both CR LF and LF with the file did not help.

Then I thought that the "Chinese-traditional" might not work on Linux. I replaced it with "zh_TW.BIG5" but did not get the expected result either.

3

Answers


  1. setlocale affects the locale of your program.

    It has no effect on the default encoding of the text displayed by the terminal window. The terminal window is an independent application, with its own locale.

    Pretty much all modern Linux distributions default to UTF-8 as the encoding for the system console and the terminal windows (gnome-terminal, Konsole, xfce4-terminal, etc…).

    Changing your program’s locale only affects how your application interprets text, but the terminal still expects your application to produce UTF-8 output. The terminal window has no knowledge of the internal locale of the application running in the terminal window. Terminal windows expect applications to produce output using the system locale’s character encoding.

    It is theoretically possible for the C library to know the default system encoding and silently transcode all the output, however it does not work this way.

    You will have to do all the work of transcoding big5 to UTF-8, using the iconv library, on Linux.

    A low cost, cheap shortcut, would be for your program to fork and run the iconv command line tool as a child process, and pipe its output to it, then let iconv do the transcoding on the fly.

    Login or Signup to reply.
  2. First check you have the locale for "Chinese-traditional" installed. On Linux this is zh_TW.UTF-8. You can check using locale -a. If it’s not listed, install it:

    sudo locale-gen zh_TW.UTF-8
    sudo update-locale
    

    (There’s a list of locales here with their names on Linux and Windows.)

    Then use imbue with the input and output streams to set the locale of the streams.

    By default, std::wcout is synchronized to the underlying stdout C stream, which uses an ASCII mapping and displays ? in place of Unicode characters it cannot handle. If you want to print Unicode characters to the terminal, you have to turn that synchronization off. You can do that with one line and set the locale of the terminal:

    std::ios_base::sync_with_stdio(false);
    std::wcout.imbue(loc);
    

    Amended version of your code:

    #include <string>
    #include <locale>
    #include <fstream>
    #include <iostream>
    
    int main(int argc, char* argv[])
    {
        auto loc = std::locale("zh_TW.utf8");
    
        //Disable synchronisation with stdio & set locale
        std::ios::sync_with_stdio(false);
        std::wcout.imbue(loc);
    
        //Set locale of input stream
        std::wstring wstr;
        std::wifstream input_file("test.txt");
        input_file.imbue(loc);
    
        //Set locale of outputput stream
        std::wofstream output_file("test_output.txt");
        output_file.imbue(loc);
    
        int counter = 0;
        while(std::getline(input_file, wstr)) {
            for(int i = 0; i < wstr.size(); i++) {
                if(wstr[i] == L'|') {
                    counter++;
                }
            }
            std::wcout << wstr << std::endl;
            output_file << wstr << std::endl;
        }
        input_file.close();
        output_file.close();
        std::wcout << counter << std::endl;
        return 0;
    }
    
    Login or Signup to reply.
  3. use std::wcout to print std::wstring instead of std::cout 🙂

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search