PHP: Splitting a string into an array around words wrapped in tildes & keeping those words

indextwo
February 28, 2023
191 views
2 votes
3 Answers

It’s very late and I think I’ve been staring at this too long to figure out, but: I have been provided a bunch of raw text where anything within in tildes (~) is a title, and everything else is just plain text. However, the text may or may not include newlines; for example:

Title & text on the same line:
~THE BURGER MINI~A tiny little burger patty in a tiny little bun.

Title & text on different lines:

~THE BURGER MAX~
A gigantic hunk of steak in between two toasted baguettes, each stuffed with beef & cheese`

A combination of both:

~THE BURGER ZERO~
No burger, no bun, just air.

~THE BURGER ITALIANO~
A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.~NOTE~This is basically giant ravioli.

Ultimately the kind of output I’m trying to achieve would be something like:

Array
(
    [0] => Array
        (
            [title] => THE BURGER ZERO
        )

    [1] => Array
        (
            [text] => No burger, no bun, just air.
        )

    [2] => Array
        (
            [title] => THE BURGER ITALIANO
        )

    [3] => Array
        (
            [text] => A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.
        )

    [4] => Array
        (
            [title] => NOTE
        )

    [5] => Array
        (
            [text] => This is basically giant ravioli.
        )

)

…so I can then differentiate between titles & text, but crucially in the order they appear.

I can split the string in newlines into an array with the following:

$tempArray = preg_split('/s*Rs*/', trim($str), NULL, PREG_SPLIT_NO_EMPTY);

But after that, I get stuck. Using preg_split on any group within tildes (preg_split('/~(.*?)~/uim', $line);) will give me all of the paragraph text, but loses the titles (as they’re being used for the split). I’ve been banging my head against various forms of preg_match & preg_match_all but all I’m getting is a headache.

Is there a straightforward way to get what I’m after that would work with all of the above examples?

Answers

preg_match_all('/~([^~]+)~n*([^~n]+)/', $str, $match);

So, match a tilde, followed by one or more of anything but a tilde, followed by another tilde. Capture what’s between the tildes:

~([^~]+)~

Followed by zero or more newlines:

n*

Followed by one or more of anything but tildes and newlines. And capture that.

([^~n]+)

This will give you the titles in $match[1] and the descriptions in $match[2]:

print_r($match[1]);

Array
(
    [0] => THE BURGER ZERO
    [1] => THE BURGER ITALIANO
    [2] => NOTE
)

print_r($match[1]);

Array
(
    [0] => No burger, no bun, just air.
    [1] => A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.
    [2] => This is basically giant ravioli.
)

Which you might then combine into a single array:

$items = array_combine($match[1], $match[2]);
print_r($items);

Array
(
    [THE BURGER ZERO] => No burger, no bun, just air.
    [THE BURGER ITALIANO] => A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.
    [NOTE] => This is basically giant ravioli.
)

<?php
$input = '~THE BURGER ZERO~
No burger, no bun, just air.

~THE BURGER ITALIANO~
A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.~NOTE~This is basically giant ravioli.';

$splittedText = array_values(array_filter(explode ("~", $input)));

foreach($splittedText as $key => $value){
    if (ctype_upper(str_replace(' ', '', $value))){
        $splittedText[$key] = ['title' => $value];
    }
    else{
        $splittedText[$key] = ['text' => $value];
    }
}

print_r($splittedText);

This solution is without the usage of any regex.

How it works is that

First explode the whole string on the wave dash
Then clean the array from empty spots, rearrange keys and iterate the array
Check if the value that we are iterating is all capitals (removing the spaces), if it is then we set the key to be "title" otherwise it’s "text" as stated in the expected output.

The output is:

  Array
(
    [0] => Array
        (
            [title] => THE BURGER ZERO
        )

    [1] => Array
        (
            [text] => 
No burger, no bun, just air.


        )

    [2] => Array
        (
            [title] => THE BURGER ITALIANO
        )

    [3] => Array
        (
            [text] => 
A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.
        )

    [4] => Array
        (
            [title] => NOTE
        )

    [5] => Array
        (
            [text] => This is basically giant ravioli.
        )

)

A way with preg_split that has the useful option PREG_SPLIT_DELIM_CAPTURE that returns captured parts of the delimiter:

$str = <<<TEXT
~THE BURGER ZERO~
No burger, no bun, just air.

~TRICKY TEST~
Meet me ~5pm.

~THE BURGER ITALIANO~
A soft mix of ground beef & mozzarella stuffed between two pillowy pieces of pasta.~NOTE~This is basically giant ravioli.

~THE BURGER MINI~A tiny little burger patty in a tiny little bun.
TEXT;

$pattern = '/ s* ~ ( [p{Lu} ]+ ) ~ s* /ux';

$arr = preg_split($pattern, $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

print_r(array_chunk($arr, 2));

demo

Please signup or login to give your own answer.

Click here to cancel reply.