skip to Main Content

I want to analyze texts for words and/or terms that don’t include predefined syllables/letters but I don´t get the result I expect.

I tried many things, my closest try was this for example.
In the sentence (some things are no real words just to check if letter combinations would be considered in the expected way) "He was eating the cake se in hell." I want to get all words but not those which have a "he" in the word whether it is the beginning, inside, or the ending.

Example:

$pattern = '#b(w*[^(*he*|s)])b#i';
$text = 'He was eating the cake se in hell.';
if(preg_match_all($pattern, $text,$match)){
    var_dump($match);
} else{
    echo "Match not found.";
}

I would expect to get

[was, eating,  cake, se, in] 

but I got

[was, eating, in, hell].

Why not "cake"? Why "hell"?

In fact, my use case is in German but because most users here are not German-speaking I try to use the example above. Also a problem is that w wouldn’t consider üÜöÖäÄß letters which I also need.

3

Answers


  1. Would be more flexible to use string functions..

    <?php
    
    $string = 'He was eating the cake se in hell.';
    $filter = 'he';
    function wordFilter($string, $filter)
    {
            $filtered = [];
            $words = str_word_count($string, 1, $filter);
            foreach($words as $word){
                    $lc_word = strtolower($word);
                    if (str_contains($lc_word, $filter)) {
                            continue;
                    } else {
                            $filtered[] = $word;
                    }
            }
            return $filtered;
    }
    
    $result = wordFilter($string, $filter);
    print_r($result);
    ?>
    
    Login or Signup to reply.
  2. I want to get all words but not those which have a "he" in the word
    whether it is the beginning, inside, or the ending.

    and

    Also a problem is that w wouldn’t consider üÜöÖäÄß letters which I
    also need.

    To Achieve these both conditions; you can change the pattern match to :

    $pattern = '/b(?!.*bheb)[p{L}üÜöÖäÄß]+b/u';
    
    Login or Signup to reply.
  3. You can extract matches of the regular expression

    /b(?:(?!he)p{L})+b/iu
    

    Demo

    You can see at the link that the following words were matched.

    He was eating the cake se in hell üÜöÖäÄß.
       ^^^ ^^^^^^     ^^^^ ^^ ^^      ^^^^^^^
    

    The flags set for the regular expression are

    • i: case insensitive match
    • u: match with full Unicode

    The regular expression has the following elements.

    b        match a word boundary
    (?:       begin a non-capture group
      (?!he)  negative lookahead asserts next two characters are 'he'
      p{L}   match a letter
    )+        end non-capture group and execute one or more times 
    b        match a word boundary
    

    Here I’ve used the tempered greedy token solution.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search