skip to Main Content

I’m building a CLI app in PHP that has a method to output text:

$out->line('Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.');

I’m limiting the line output to 80 characters within line() via:

public function line(string $text): void
{
  $this->rawLine(wordwrap($text, 80, PHP_EOL));
}

This prints the output across multiple lines:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia
bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id
elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus
porttitor.

Now, I can also style parts of the text using ANSI escape codes:

$out->line('Morbi leo risus, ' . Style::inline('porta ac consectetur', ['color' => 'blue', 'attribute' => 'bold']) . ' ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.');

Which gets converted to this:

Morbi leo risus, x1b[34;1mporta ac consecteturx1b[39;22m ac, vestibulum at
eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh
ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur.
Curabitur blandit tempus porttitor.

And when passed to line(), printed out like this:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros.
Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies
vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur
blandit tempus porttitor.

Where "porta ac consectetur ac" is blue and bold, but if you notice, the line is shorter than before and doesn’t break at the same place.

Even though these are non-printing characters, wordwrap() (and strlen()) has issues calculating the length appropriately.

The first line is originally 76 characters without ANSI escape codes:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia

But after adding styles, it comes back as 97 characters:

Morbi leo risus, x1b[34;1mporta ac consecteturx1b[39;22m ac, vestibulum at eros. Aenean lacinia

In other parts of the app, like a table, I "solved" this by having a method to set the column value and then a separate method to style said column. That way, I can reliably get the length, but also output the text in the defined style.

I could pass both an unstyled version and then a style version of the text, but that doesn’t feel right. Nor does it solve the problem of then splitting the style version accurately.

To solve the issue with line(), I thought about stripping out the ANSI escape codes to get actual length, then add the PHP_EOL break where needed, and then inject the style back in, but that doesn’t feel like the right solution and it seems complicated– how would I even go about doing that?

So my question is: How can I reliably split text containing ANSI escape codes based on text length?

2

Answers


  1. Chosen as BEST ANSWER

    This is the input:

    $styledText = "Morbi leo risus, x1b[34;1mporta ac consecteturx1b[39;22m ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.";
    

    The following method strips out escape codes from styled text and saves a copy as clean text.

    The clean text is used to add line breaks using wordwrap based on desired column width.

    It loops over styled text and injects a line break after every word in which PHP added a line break in clean text.

    function wrap(string $styledText) {
    
      // Strip ANSI escape codes from $styledText
      $cleanText = preg_replace('/\x1b[[0-9;]+m/', '', $styledText);
    
      // Add PHP_EOL to ensure $cleanText does not exceed line width
      $cleanWrappedText = wordwrap($cleanText, 80, PHP_EOL . ' ');
    
      // Split $styledText and $cleanWrappedText on each space
      $styledTextArray = explode(' ', $styledText);
      $cleanTextArray = explode(' ', $cleanWrappedText);
    
      // $fusedText will comprise $styledText w/ line breaks from $cleanWrappedText
      $fusedText = '';
    
      // Loop over each segment (likely a word)
      foreach ($styledTextArray as $index => $segment) {
    
        // Append word (with ANSI escape codes)
        $fusedText .= $segment;
    
        // If word has line break in clean version then add line break
        if (str_ends_with($cleanTextArray[$index], PHP_EOL)) {
            $fusedText .= PHP_EOL;
            continue;
        }
    
        // If word does not have line break in clean version,
        // but there is another word coming, then add space between words
        if (isset($cleanTextArray[$index+1])) {
            $fusedText .= ' ';
        }
      }
    
      return $fusedText;
    }
    

    Note that this can't easily be tested on the web, since the escape codes only style text appropriately when used via a CLI.


  2. Based on an approach I’ve used to truncate text in another answer (Truncate a multibyte String to n chars), counting the length of segments just needs to ignore the ANSI sequences while counting characters.

    To have clean breaks in the text, the snippet below will only replaces spaces with newlines (it is not designed to break on hyphens).

    Code: (Demo) (Regex101 Demo)

    function ansiSafeWrapper(string $string, int $max = 80) {
        return preg_replace(
            "~(?=(?:(?:\\x1b[[0-9;]+m)?.){{$max}})(?:(?:\\x1b[[0-9;]+m)?.){0,$max}K ~u",
            PHP_EOL,
            str_replace(PHP_EOL, ' ', $string)
        );
    }
    
    $test = <<<'ANSI'
    Morbi leo risus, x1b[34;1mporta ac consecteturx1b[39;22m ac, vestibulum at
    eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh
    ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur.
    Curabitur blandit tempus porttitor.
    ANSI;
    
    echo ansiSafeWrapper($test);
    

    Effectively, the script replaces all newlines with spaces, then injects new newlines where deemed appropriate to return:
    I’ve added the character counts at the end of each line for clarity.

    Morbi leo risus, x1b[34;1mporta ac consecteturx1b[39;22m ac, vestibulum at eros. Aenean lacinia  (97 char)
    bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id  (80 char)
    elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus  (77 char)
    porttitor. (10 char)
    

    Which will be visually presented without ANSI sequences as:

    Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia  (76 char)
    bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id  (80 char)
    elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus  (77 char)
    porttitor. (10 char)
    

    Patter Breakdown:

    ~                                   #starting pattern delimiter
    (?=                                 #start of lookahead
       (?:(?:\\x1b[[0-9;]+m)?.){80}  #consume potential whole ansi code before each single character; match 80 (non-ansi) characters
    )                                   #end of lookahead
    (?:(?:\\x1b[[0-9;]+m)?.){0,80}   #consume potential whole ansi code before each single character; match upto 80 (non-ansi) characters
    K                                  #forget any characters matched this this point, then match a literal space
    ~                                   #ending pattern delimiter
    u                                   #unicode pattern flag for multibyte safety
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search