skip to Main Content

I have text that was converted between two encodings (non UTF-8) and then saved as UTF-8. How to restore the encoding using php?

Everything works fine if we just need to convert the text from any encoding to UTF-8:

$text = 'РєСѓСЂСЃ';
$text = mb_convert_encoding($text, "WINDOWS-1251", "UTF-8");
echo($text);

// OUTPUT: курс
// Works!

Things get more complicated if the text has been converted between two non UTF-8 encodings. For example from IBM866 to WINDOWS-1251.

Direct conversion does not work at all:

$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
echo($text);

// OUTPUT: �??�?�?�?�?�?�?�?�?�?�? �?�??�?�?�?�?�?�?�?�?�?�?�?�?�?�? �?�?�?�?�?�?�?�?�??
// Does not work

Things got better when I added conversion from and to UTF-8:

$text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
$text = mb_convert_encoding($text, "IBM866", "utf-8");
$text = mb_convert_encoding($text, "IBM866", "WINDOWS-1251");
$text = mb_convert_encoding($text, "utf-8", "IBM866");
echo($text);

// OUTPUT: Определение ?Информационного продукта?
// Almost works. Instead of "?" should be "«" and "»"

And in some combinations of encodings no option works. For example from ISO-8859-1 to IBM866:

$text = 'ਫ®¦¥­¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
echo($text);

// OUTPUT: ???????????????????? 4
// Does not work


$text = 'ਫ®¦¥­¨¥ 4';
$text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
$text = mb_convert_encoding($text, "ISO-8859-1", "IBM866");
$text = mb_convert_encoding($text, "utf-8", "ISO-8859-1");
echo($text);

// OUTPUT: ?????????? 4
// Does not work

To make sure the original lines are ok, I did the same transformations in Python:

text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗'
text = text.encode('cp866').decode('windows-1251')
print(text)
// OUTPUT: Определение «Информационного продукта»
// Works

text = 'ਫ®¦¥­¨¥ 4'
text = text.encode('ISO-8859-1').decode('cp866')
print(text)
// OUTPUT: Приложение 4
// Works

Is it possible to get the same result in PHP as in Python?

2

Answers


  1. Chosen as BEST ANSWER

    Thanks to JosefZ for the answer! I modified his answer to not use additional libraries:

    $text = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
    $text = mb_convert_encoding($text, "IBM866", "utf-8");
    $text = mb_convert_encoding($text, "utf-8", "WINDOWS-1251");
    echo($text);
    // OUTPUT: Определение «Информационного продукта»
    
    $text = 'ਫ®¦¥­¨¥ 4';
    $text = mb_convert_encoding($text, "ISO-8859-1", "utf-8");
    $text = mb_convert_encoding($text, "utf-8", "IBM866");
    echo($text);
    // OUTPUT: Приложение 4
    

  2. Apply UConverter::transcode – Convert a string from one character encoding to another.

    Description

    public static UConverter::transcode(
        string $str,
        string $toEncoding,
        string $fromEncoding,
        ?array $options = null
    ): string|false
    

    Converts str from fromEncoding to toEncoding.

    The following script applies exactly the same encoding and decoding mechanism like Python code snippet .encode('cp866').decode('cp1251').

    <?php
    $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
    $text1 = UConverter::transcode($text0, 'IBM866', 'UTF-8');
    $text2 = UConverter::transcode($text1, 'UTF-8', "CP1251");
    var_dump($text2);
    ?>
    

    Output .SO76752650.php

    string(74) "Определение «Информационного продукта»"
    

    A more compact code snippet (the same result):

    <?php
    $text0 = '╬яЁхфхыхэшх л╚эЇюЁьрЎшюээюую яЁюфєъЄр╗';
    var_dump(
        UConverter::transcode(
            UConverter::transcode(
                $text0, 'IBM866', 'UTF-8'), 'UTF-8', "CP1251")
            );
    ?>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search