I am trying to removing some strange printing characters that are in several files, the contents of these files have been pulled into a PHP string.
I have tried using preg_replace
to remove the strange printing characters, but haven’t had much success.
The strange part is the regex I used with preg_replace
does seem to work when I test it using a web based regex tester, so am confused as to why it doesn’t work when I have the same regex in my PHP file.
The input data is just over 2000 lines, below is a snippet of the input data showing the þ
which is what I am wanting to remove along with the $NoCode
$800C5304 0063
$800C5306 0063
$800C5308 0063
$800C530A 0063
$800C530C 0063
$800C530E 0063
$800C5310 0063
$800C5312 0063
$800C5314 0063
$800C5316 0063
$800C5318 0063
$800C531A 0063
$800C531C 0063
þ
$NoCode
This is the regex I have tried with preg_replace
$fileData = preg_replace("/$([A-F0-9]+) ([A-F0-9]+)n(.+)n$NoCode/", "'$$1 $2'", $fileData);
From the link below, the þ
seems to be or at least part of a byte order mark in UTF-16.
When I run iconv(mb_detect_encoding($fileData), 'UTF-8', $fileData);
I get:
iconv(): Detected an illegal character in input string
.
If I do iconv('UTF-16', 'UTF-8', $fileData)
instead I get:
iconv(): Detected an incomplete multibyte character in input
2
Answers
So it seems the
þ
was an incomplete multibyte string. I fixed this using the command below to remove the incomplete multibyte strings.$fileData = mb_convert_encoding($fileData, 'UTF-8', 'UTF-8');
This left a
?
where theþ
originally was, I then removed this using the following.$fileData = str_replace("n?n$NoCode", '', $fileData);
str_replace should be faster than preg_replace
Here is an example:
Or if you want get rid of empty lines too: