skip to Main Content

I need to convert a text query from standard "verbal" boolean to the FULLLTEXT MATCH AGAINST + and – system.

For example; this value:

((word1 AND word2) OR (word3 AND word4 AND word5)) OR word6 NOT word7

Needs to be converted to:

((+word1 +word2) (+word3 +word4 +word5) word6 -word7

The issue I’m having is when I try to use REGEXP_REPLACE with a pattern like this:

\b([[:alnum:]_]+)\b

I can’t figure out how to have the function use the word that matched to append/amend to it. I’ve seen other examples trying to use 1 or something like that to represent the matching value, but it simply doesn’t work.

So how can I command REGEXP_REPLACE to dynamically scan a string for "(word +word" and replace it with "(+word +word" ? Also needs to do a leading simple one like:

It’s also a bit more complex, because if we have:

(word1 AND word2) AND (word3 OR word4)

That would need to be:

+(+word1 +word2) +(word3 word4)

Any guidance on the best way to do this without writing an ultra slow parser would be appreciated.

2

Answers


  1. Chosen as BEST ANSWER

    I figured out how to convert a AND/OR/NOT boolean statement like the below to a compatible MATCH AGAINST solution, and hope this helps other people.

    This converts:

    (word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7
    

    TO

    (WORD1 WORD2) (+WORD3 +WORD4 +WORD5) WORD6 -WORD7
    

    CODE:

    SET in_text_search = '(word1 OR word2) OR (word3 AND word4 AND word5) OR word6 NOT word7';
    SET text_reformatted = UCASE(TRIM(in_text_search));
    SET text_reformatted = REPLACE(text_reformatted,'  ',' '); -- Strip extra spaces
    
    -- Make sure the count of parentheses (if exist) is equal.
    IF (INSTR(text_reformatted,'(') > 0 OR INSTR(text_reformatted,')') > 0) THEN
            SET count_character = LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,'(',''));
            IF (count_character <> (LENGTH(text_reformatted) - LENGTH(REPLACE(text_reformatted,')','')))) THEN
                    -- TRIGGER ERROR 'imbalanced parentheses';
                    LEAVE SP;
            END IF;
    END IF;
    
    SET text_reformatted = REPLACE(text_reformatted,' AND ',' +');
    SET text_reformatted = REPLACE(text_reformatted,' OR ',' ');
    SET text_reformatted = REPLACE(text_reformatted,' NOT ',' -');
    
    -- Primary replace
    SET text_reformatted = REGEXP_REPLACE(text_reformatted, '(?<![-+])\b(\w+)\b(?=(?:\s*\+|\s*$))', '+$1', 1, 0, 'c');
    
    -- Cleanup the cart before the horse, for words like .NET etc.
    SET text_reformatted = REPLACE(text_reformatted,'.+','+.');
    SET text_reformatted = REPLACE(text_reformatted,'.-','-.');
    

  2. In a way, this SQL-style solution was made for fun, but maybe it will help someone.

    with RECURSIVE
      t1 as (
        select
          '(word1 OR word2) and .net AND (word3 AND word4 AND word5) OR word6 NOT word7' in_text_search
      ),
      r as (
        select
          0 as occ,
          cast('' as char(127)) as tok,
          in_text_search,
          0 as parentheses,
          cast('process' as char(30)) as status
        from t1
        union all
        select
          r.occ+1,
          case
            when t.tok = 'AND' then '+'
            when t.tok = 'NOT' then '-'
            when t.tok = 'OR' then ' '
            else coalesce(t.tok, '')
          end,
          r.in_text_search,
          p.parentheses,
          case
            when p.parentheses < 0 then 'imbalanced parentheses'
            when t.tok is null then
              case
                when p.parentheses != 0 then 'imbalanced parentheses'
                else 'done'
              end
            else 'process'
          end
        from r,
        LATERAL(
          select
            ucase(regexp_substr(
              r.in_text_search,
              '[()]|[^()[:space:]]+',
              1,
              r.occ+1
            )) as tok
        ) as t,
        LATERAL(
          select
            r.parentheses +
              case
                when t.tok like '(' then 1
                when t.tok like ')' then -1
                else 0
              end as parentheses
        ) as p
        where status = 'process'
      ),
      t2 as (
        select
          occ,
          tok,
          lead(tok, 2) over(order by occ) as tok_2nd,
          status
        from r
      ),
      t3 as (
        select
          occ,
          case
            when tok = '(' and tok_2nd in ('+', '-') then concat(tok, tok_2nd)
            else tok
          end as tok,
          status
        from t2
      )
    select
      GROUP_CONCAT(
        case
          when tok in ('+', '-') then concat(' ', tok)
          else tok
        end
        order by occ separator ''
      ) as text_reformatted
    from t3
    ;
    
    +----------------------------------------------------------+
    |                     text_reformatted                     |
    +----------------------------------------------------------+
    | (WORD1 WORD2) +.NET +(+WORD3 +WORD4 +WORD5) WORD6 -WORD7 |
    +----------------------------------------------------------+
    

    db<>fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search