Removing printing characters from a string using PHP

AeroMaxx
October 15, 2022
277 views
0 votes
2 Answers

I am trying to removing some strange printing characters that are in several files, the contents of these files have been pulled into a PHP string.

I have tried using preg_replace to remove the strange printing characters, but haven’t had much success.

The strange part is the regex I used with preg_replace does seem to work when I test it using a web based regex tester, so am confused as to why it doesn’t work when I have the same regex in my PHP file.

The input data is just over 2000 lines, below is a snippet of the input data showing the þ which is what I am wanting to remove along with the $NoCode

$800C5304 0063
$800C5306 0063
$800C5308 0063
$800C530A 0063
$800C530C 0063
$800C530E 0063
$800C5310 0063
$800C5312 0063
$800C5314 0063
$800C5316 0063
$800C5318 0063
$800C531A 0063
$800C531C 0063
þ
$NoCode

This is the regex I have tried with preg_replace

$fileData = preg_replace("/$([A-F0-9]+) ([A-F0-9]+)n(.+)n$NoCode/", "'$$1 $2'", $fileData);

From the link below, the þ seems to be or at least part of a byte order mark in UTF-16.

Remove ÿþ from string

When I run iconv(mb_detect_encoding($fileData), 'UTF-8', $fileData); I get:

iconv(): Detected an illegal character in input string.

If I do iconv('UTF-16', 'UTF-8', $fileData) instead I get:

iconv(): Detected an incomplete multibyte character in input

Answers

Chosen as BEST ANSWER
- AeroMaxx
- October 15, 2022 at 9:51 am
- 0 votes
0
So it seems the þ was an incomplete multibyte string. I fixed this using the command below to remove the incomplete multibyte strings.

$fileData = mb_convert_encoding($fileData, 'UTF-8', 'UTF-8');

This left a ? where the þ originally was, I then removed this using the following.

$fileData = str_replace("n?n$NoCode", '', $fileData);

(Edit)

- KazimierzNiewielki
- October 15, 2022 at 9:31 am
- 0 votes
0
str_replace should be faster than preg_replace
Here is an example:
```
$input = file_get_contents('input.txt');
$output = str_replace(['þ','$NoCode'], '', $input);
file_put_contents('output.txt', $output);
```
Or if you want get rid of empty lines too:
```
$input = file_get_contents('input.txt');
$output = str_replace(["þrn","$NoCodern", "þn","$NoCoden", "þr", "$NoCoder"], '', $input);
file_put_contents('output.txt', $output);
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.