I want to create a function that labels the location of certain HTML tags (e.g., italics tags) in a string with respect to the locations of characters in a tagless version of the string.
(I intend to use this label data to train a neural network for tag recovery from data that has had the tags stripped out.)
The magic function I want to create is label_italics()
in the below code.
$string = 'Disney movies: <i>Aladdin</i>, <i>Beauty and the Beast</i>.';
$string_all_tags_stripped_but_italics = strip_tags($string, '<i>'); // same as $string in this example
$string_all_tags_stripped = strip_tags($string); // 'Disney movies: Aladdin, Beauty and the Beast.'
$featr_string = $string_all_tags_stripped.' '; // Add a single space at the end
$label_string = label_italics($string_all_tags_stripped_but_italics);
echo $featr_string; // 'Disney movies: Aladdin, Beauty and the Beast. '
echo $label_string; // '0000000000000001000000101000000000000000000010'
If a character is supposed to have an <i>
or </i>
tag immediately preceding it, it is labeled with a 1 in $label_string
; otherwise, it is labeled with a 0 in $label_string
. (I’m thinking I don’t need to worry about the difference between <i>
and </i>
because the recoverer will simply alternate between <i>
and </i>
so as to maintain well-formed markup, but I’m open to reasons as to why I’m wrong about this.)
I’m just not sure what the best way to create label_italics()
is.
I wrote this function that seems to work in most cases, but it also seems a little clunky and I’m posting here in hopes that there is a better way. (If this turns out to be the best way, the below function would be easily generalizable to any HTML tag passed in as a second argument to the function, which could be renamed label_tag()
.)
function label_italics($stripped) {
while ((stripos($stripped, '<i>') || stripos($stripped, '</i>')) !== FALSE) {
$position = stripos($stripped, '<i>');
if (is_numeric($position)) {
for ($c = 0; $c < $position; $c++) {
$output .= '0';
}
$output .= '1';
}
$stripped = substr($stripped, $position + 4, NULL);
$position = stripos($stripped, '</i>');
if (is_numeric($position)) {
for ($c = 0; $c < $position; $c++) {
$output .= '0';
}
$output .= '1';
}
$stripped = substr($stripped, $position + 5, NULL);
}
for ($c = 0; $c <= strlen($stripped); $c++) {
$output .= '0';
}
return $output;
}
The function produces bad output if the tags are surplus or the markup is badly formed in the input. For example, for the following input:
$string = 'Disney movies: <i><i>Aladdin</i>, <i>Beauty and the Beast</i>.';
The following misaligned output is given.
Disney movies: Aladdin, Beauty and the Beast.
0000000000000001000000000101000000000000000000010
(I’m also open to reasons why I’m going about the creation of the label data all wrong.)
2
Answers
After some additional experimentation, this is what I arrived at:
$label_string = mb_ereg_replace('#0', '1', mb_ereg_replace('(#)1+0', '1', mb_ereg_replace('/', '0', mb_ereg_replace('i', '0', mb_ereg_replace('</i>', '#', mb_ereg_replace('<i>', '#', mb_ereg_replace('[^</i>]', '0', mb_strtolower($featr_string))))))));
I couldn't get @KIKO Software's preg_replace()-based solution to work with multibyte strings. So I changed to this slightly ungainly, but better-operative, mb_ereg_replace()-based solution instead.
I think I’ve got something. How about this:
see: https://3v4l.org/cKG46
Note that you need to supply the string with the tags in it.
How does it work?
I use preg_replace() because it can use regular expressions, which I need once. This function goes through the two arrays and execute each replacement in order. First it replace all occurrences of
<i>
and</i>
by#
and anything else by0
. Then replaces##0
by2
and#0
by1
. The2
is extra to be able to replace<i></i>
. You can remove it, and simplify the function, if you don’t need it.The use of the
#
is arbitrary. You should use anything that doesn’t clash with the content of your string.Here’s an updated version. It copes with tags at the end of the line and it ignores any
#
characters in the line.See: https://3v4l.org/BTnLc