we are using Laravel and a package called maatwebsite/excel for exporting data via XLS for our clients
In a recent issue we faced, the XLS download was broken, most of the data was dissapearing. After debugging closely, we found that 1 one of the data points had this value in it – "England 🏴"
Now we already have a piece of code which stips of any emojis and replaces them with ??
instead so we don’t face this problem of exporting broken XLS docs.
But in this case we found something weird. Even when our replaceEmojis
method ran on this string, it left invisible characters behind. This is our replaceEmojis
code:
if (!function_exists('replaceEmojis')) {
function replaceEmojis($string, $replaceWith = '??') {
// Define a regular expression pattern to match emojis
$pattern = '/[x{1F600}-x{1F64F}x{1F300}-x{1F5FF}x{1F680}-x{1F6FF}x{1F700}-x{1F77F}x{1F780}-x{1F7FF}x{1F800}-x{1F8FF}x{1F900}-x{1F9FF}x{1FA00}-x{1FA6F}x{1FA70}-x{1FAFF}x{1FAB0}-x{1FABF}x{1FAC0}-x{1FAFF}x{1FAD0}-x{1FAD9}x{1FAD0}-x{1FAD9}x{1F300}-x{1F5FF}x{1F004}-x{1F0CF}x{1F170}-x{1F251}x{200D}]+/u';
// Use preg_replace_callback to replace emojis with "??"
$result = preg_replace_callback($pattern, function ($match) use ($replaceWith) {
return $replaceWith;
}, $string);
return $result;
}
}
And after running that string, we get this back
England ??
But when we paste this in https://www.soscisurvey.de/tools/view-chars.php and view the result, this is what we see:
We have even tried with a different method we found here -> https://stackoverflow.com/a/68155491/2730064
But still we face a similar problem. And as soon as we remove that 🏴 from the data, XLS works fine. So we are sure this is what is causing the issue after all. Any ideas on how to fix this? How can we remove those invisible characters from the string? So it doesn’t happen for other emojis in the future?
2
Answers
Those characters are part of the character sequence used to represent the flag of England emoji. Most flag emoji are represented using pairs of characters from the range U+1F1E6..U+1F1FF, but some flag emoji use tag sequences with characters from the ranges U+E0030..U+E0039, U+E0061..U+E007A and ending with U+E007F. See https://www.unicode.org/reports/tr51/#def_emoji_flag_sequence and https://www.unicode.org/reports/tr51/#flag-emoji-tag-sequences for info about the syntax, and see the RGI_Emoji_Flag_Sequence and RGI_Emoji_Tag_Sequence sections in https://www.unicode.org/Public/emoji/latest/emoji-sequences.txt for lists of character sequences for flag emoji that are commonly used.
Your regex should simply strip (or replace with ?) the ranges U+E0030..U+E0039, U+E0061..U+E007A and U+E007F.
The problem is that you are thinking in code points and not in glyphs. A glyph can be composed with several code points, for example:
A chance that pcre has a feature to match a glyph:
X
So you can rewrite your pattern like that:
For each position that matches your character class,
X
will consume the code points until the end of the glyph.(Note that I used the character class from your question, I only joined ranges when it was possible and I do not pretend that this character class is the good one to remove all the emojis of the universe.)