skip to Main Content

I want to detect invalid boolean queries that do not use parentheses. Specifically, I want to detect cases where multiple operators are used in a boolean query without parentheses, which can lead to ambiguous or incorrect results.

For example, consider the following boolean query:

love AND family OR trees

The user could mean (love AND family) OR trees or love AND (family OR trees), and the results would be different. To avoid such ambiguity, I want to give the user an error message when they try to do searches like this.

More Examples:

These queries do not need parentheses and are valid:

love AND family AND trees ,

love OR family OR trees OR peace ,

love NOT family NOT peace NOT peace ,

love OR family OR (ash AND peace) ,

(love OR family OR ash) AND peace

The below queries need parentheses and are invalid:

love AND family OR trees ,

love NOT family OR trees ,

love AND family NOT trees ,

love AND family OR war AND peace ,

(love OR family OR ash) AND peace OR ash

How can I detect invalid boolean queries without parentheses?

3

Answers


  1. A simple way to solve this would be to define a formal grammar for your language, feed it to a parser generator like peg.js and include the generated parser in your project. Eventually, you’re going to need a parser anyways, to evaluate your expressions.

    One possible grammar for your language could be like this ((you can try it on https://pegjs.org/online)

    Expression = Term Tail?
    
    Tail 
        = (_ 'AND' _ Term)+ 
        / (_ 'OR'  _ Term)+
        / (_ 'NOT' _ Term)+
    
    Term = Word / '(' _ Expression _ ')'
    
    Word = [a-z]+
    
    _ = [ tnr]*
    

    This grammar accepts your valid examples and rejects invalid ones.

    Login or Signup to reply.
  2. Assuming that the names only consist of letters, numbers and _, here’s a regex-based solution:

    function validate(query) {
        while (query.includes('(')) {
            query = query.replace(/(([^()]+))/g, (_, m) => validate(m) ? '_' : '');
        }
        return /^w+(?:s+(AND|OR|NOT)s+w+)?(?:s+1s+w+)*$/.test(s.trim());
    }
    

    Explanation:

    First, from observation, we can see that valid queries follow a pattern:

    name <AND/OR/NOT> name <the same operator> name ...
    

    This sounds like a thing for regex:

    ^                    # At the beginning of line,
    w+                  # match a name
    (?:                  # followed by
      s+(AND|OR|NOT)s+ # AND, OR, or NOT surrounded by one or more spaces
      w+                # then another name,
    )?                   # optionally,
    (?:                  # then
      s+1s+           #                which each consists of the same operator
      w+                #                                                         and a name
    )*                   #      0+ groups
    $                    # right before the end of line.
    

    Since JS regex doesn’t support recursion, we have to polyfill that on our own. For a query to be valid, all parenthesized expressions in it also have to be valid. We replace them with a valid unused name like _ if they are valid; otherwise, we omit it entirely so that the regex test will returns false.

    while (query.includes('(')) {
        query = query.replace(
            /(([^()]+))/g, // Opening, one or more non-parenthesis chars, then closing.
            (_, m) => validate(m) ? '_' : '' // m is group 1, the expression we need to validate.
        );
    }
    

    If at least one expression is invalid, the query will be something like this, which won’t match the final regex test:

    Original:   '(foo AND bar OR baz) AND lorem AND ipsum'
    After loop: ' AND lorem AND ipsum'
    

    Try it:

    const valid = `love AND family AND trees
    love OR family OR trees OR peace
    love NOT family NOT peace NOT peace
    love OR family OR (ash AND peace)
    (love OR family OR ash) AND peace
    `.trim().split('n');
    
    const invalid = `
    love AND family OR trees
    love NOT family OR trees
    love AND family NOT trees
    love AND family OR war AND peace
    (love OR family OR ash) AND peace OR ash
    `.trim().split('n');
    
    function validate(query) {
        while (query.includes('(')) {
            query = query.replace(/(([^()]+))/g, (_, m) => validate(m) ? '_' : '');
        }
        return /^w+(?:s+(AND|OR|NOT)s+w+)?(?:s+1s+w+)*$/.test(query.trim());
    }
    
    valid.forEach(e => console.log(e, '|', validate(e)));
    invalid.forEach(e => console.log(e, '|', validate(e)));
    .as-console.wrapper {
      max-height: 100% !important;
    }
    Login or Signup to reply.
  3. If your question only aims at identifying the cases of AND and OR at same level, you may specify a recursive procedure,i.e checkAndOr(expression) :

    • parse the expression searching for all substrings within parenthesis
      (each term starts at first opening parenthesis and ends at corresponding
      closure.
    • for each of these terms, call checkAndOr(term),
    • if non ok, return false
    • if ok, replace the term by some variable e.g. "X",
    • after all replacements, check if the expression both contains "AND" and "OR".
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search