skip to Main Content

I am trying to removing some strange printing characters that are in several files, the contents of these files have been pulled into a PHP string.

I have tried using preg_replace to remove the strange printing characters, but haven’t had much success.

The strange part is the regex I used with preg_replace does seem to work when I test it using a web based regex tester, so am confused as to why it doesn’t work when I have the same regex in my PHP file.

The input data is just over 2000 lines, below is a snippet of the input data showing the þ which is what I am wanting to remove along with the $NoCode

$800C5304 0063
$800C5306 0063
$800C5308 0063
$800C530A 0063
$800C530C 0063
$800C530E 0063
$800C5310 0063
$800C5312 0063
$800C5314 0063
$800C5316 0063
$800C5318 0063
$800C531A 0063
$800C531C 0063
þ
$NoCode

This is the regex I have tried with preg_replace

$fileData = preg_replace("/$([A-F0-9]+) ([A-F0-9]+)n(.+)n$NoCode/", "'$$1 $2'", $fileData);

From the link below, the þ seems to be or at least part of a byte order mark in UTF-16.

Remove ÿþ from string

When I run iconv(mb_detect_encoding($fileData), 'UTF-8', $fileData); I get:

iconv(): Detected an illegal character in input string.

If I do iconv('UTF-16', 'UTF-8', $fileData) instead I get:

iconv(): Detected an incomplete multibyte character in input

2

Answers


  1. Chosen as BEST ANSWER

    So it seems the þ was an incomplete multibyte string. I fixed this using the command below to remove the incomplete multibyte strings.

    $fileData = mb_convert_encoding($fileData, 'UTF-8', 'UTF-8');

    This left a ? where the þ originally was, I then removed this using the following.

    $fileData = str_replace("n?n$NoCode", '', $fileData);


  2. str_replace should be faster than preg_replace
    Here is an example:

    $input = file_get_contents('input.txt');
    $output = str_replace(['þ','$NoCode'], '', $input);
    file_put_contents('output.txt', $output);
    

    Or if you want get rid of empty lines too:

    $input = file_get_contents('input.txt');
    $output = str_replace(["þrn","$NoCodern", "þn","$NoCoden", "þr", "$NoCoder"], '', $input);
    file_put_contents('output.txt', $output);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search