skip to Main Content

I want to ignore special characters during query time in SOLR .
For example :
Lets assume we have a doc in SOLR with content:My name is A-B-C .

content:A-B-C retunrs documents
but content:ABC doesnt return any document .

My requirement is that content:ABC should return that one document .
So basically i want to ignore that – during query time .

2

Answers


  1. Here you must be having a field Type for your field content.

    The fields type can have 2 separate analyzer. One for index and one for query.

    Here you can either create indexes of content "A-B-C" like ABC, A-B-C by using the "Word Delimiter Token Filter" .

    Use catenateWords. add as catenateWords = 1.
    It will work as follow :
    “hot-spot-sensor’s” → “hotspotsensor”. In your case “A-B-C”. it will generate “ABC”

    Here is the example of it Word Delimiter Filter

    Usage :

    <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="true" catenateWords="1"/>
    </analyzer>
    
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
    

    This will create the multiple indexes and you will be able search with ABC and A-B-C

    Login or Signup to reply.
  2. To get the tokens concatenated when they have a special character between them (i.e. A-B-C should match ABC and not just A), you can use a PatternReplaceCharFilter. This will allow you to replace all those characters with an empty string, effectively giving ABC to the next step of the analysis process instead.

    <analyzer>
      <charFilter class="solr.PatternReplaceCharFilterFactory"
                 pattern="[^a-zA-Z0-9 ]" replacement=""/>
      <tokenizer ...>
      [...]
    </analyzer>
    

    This will keep all regular ascii letters, numbers and spaces, while replacing any other character with the empty string. You’ll probably have to tweak that character group to include more, but that will depend on your raw content and how it should be processed.

    This should be done both when indexing and when querying (as long as you want the user to be able to query for A-B-C as well). If you want to score these matches differently, use multiple fields with different analysis chains – for example keeping one field to only tokenize on whitespace, and then boosting it higher (with qf=text_ws^5 other_field) if you have a match on A-B-C.

    This does not change what content is actually stored for the field, so the data returned will still be the same – only how a match is performed.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search