skip to Main Content

I’m translating user-submitted strings from UTF-8 to ASCII-Printable:

$str = 'Thê qúïck 😈 brõwn fõx júmps?😈 Óvér thé lázy dõg?😈';

$out = iconv('UTF-8', 'ASCII//TRANSLIT', $str);

var_dump($out);

$out = 'The quick ? brown fox jumps?? Over the lazy dog??';

I want the extra ? question marks from $out removed.

if ($out !== $str && strpos($out, '?') !== false) {
    // The input string was modified and contains at least one question mark
    //
    // Not even really sure where to begin
    //
    // Do we need to compare the position of every character from the
    // original string to every position of the new string and replace
    // where the original string did not contain a question mark?
    //
    // That's all I can think of, but there has to be a better way.
}

I want to keep all //TRANSLIT characters, including those few included in the example above, e.g.áéïõú = aeiou. There is no other nuace to this question. I think it boils down to a string comparison and replace question.

I’m not necessarily looking for someone to write the entire code, just a pointer in the right direction of how you’d tackle this.

2

Answers


  1. Chosen as BEST ANSWER

    This works for me, although I'm sure there are better solutions that people can come up with.

    $str = 'Thê qúïck 😈 brõwn fõx júmps?😈 Óvér thé lázy dõg?😈';
    $out = 'The quick ? brown fox jumps?? Over the lazy dog??';
    

    Output

    var_dump(remove_iconv_question_marks($str, $out));
    
    // string(46) "The quick   brown fox jumps?  Over the lazy dog? "
    

    Function

    /**
     * strip_iconv_question_marks - Remove question marks left behind by iconv()
     * after translating UTF-8 strings to ASCII strings
     *
     * @param string $str_utf8
     * @param string $str_ascii
     *
     * @return string
     */
    
    function strip_iconv_question_marks($str_utf8, $str_ascii) {
        $arr_utf8 = mb_str_split($str_utf8);
        $arr_ascii = mb_str_split($str_ascii);
    
        $count = count($arr_utf8);
    
        for ($i = 0; $i < $count; $i++) {
            if ($arr_ascii[$i] === '?') {
                if ($arr_utf8[$i] !== '?') {
                    $arr_ascii[$i] = ' '; // Prefer blank space over removal
                }
            }
        }
        return implode($arr_ascii);
    }
    

    For PHP < 7.4.0

    function mb_str_split($str, $len = 1) {
        $arr = [];
        $cnt = mb_strlen($str, 'UTF-8');
    
        for ($i = 0; $i < $cnt; $i++) {
            $arr[] = mb_substr($str, $i, $len, 'UTF-8');
        }
        return $arr;
    }
    

  2. Here is a solution based on transliterator_transliterate():

    $str = transliterator_transliterate('Latin-ASCII', 'Thê qúïck 😈 brõwn fõx júmps?😈 Óvér thé lázy dõg?😈');
    $str = preg_replace('/[x80-xFF]/', '', $str);
    echo $str;
    

    Output:

    The quick  brown fox jumps? Over the lazy dog?
    

    Note that the emoji are kept by transliterator_transliterate(), so I used a regex to remove all the remaining non-ASCII characters.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search