Data Types
Title
: stringKeywords
: stringDescription
: string
Search Requirements:
- ALL search terms MUST be found somewhere in the title or keywords or description
- Don’t search the whole DOCUMENT for terms… but search SEVERAL FIELDS to find a match for the terms.
- All the fields don’t need to contain all the search terms, but all the search terms must be found
- EG: searching for “bake cake” should match:
- title contains “cake” and description contains “bake” so that BOTH terms are found
- Because atlas search grammar did not allow us to do that, we created a concat field: title + keywords into a NEW field
- The biggest problem we have seems to follow from the fact that we cannot get a good AND clause on our search terms
- “bake cake” should match “bake” AND “cake”, instead, using Atlas search without our hacks we will get
- “cake cake cake cake” and that ranks higher than “cake bake”!
- Don’t search the whole DOCUMENT for terms… but search SEVERAL FIELDS to find a match for the terms.
- Field boosting is important
- Matching in the “title” should rank higher than matching in the “description”
- Word order matters
- “bake cake” should boost results where “bake” precedes “cake”
- A perfect match is preferred
- Keyword stuffing should be penalised.
- After computing a score, the score should be un-boosted based on frequency of matching the search terms
- “bake cake” should rank “bake a cake” HIGHER than “bake a cake bake a cake cake cake cake”
- Shorter data should rank higher
- “cake” as a search term should rank
- “cake fun” higher than
- “cake and fun and baking and frosting”
- Because all else equal, the first string is SHORTER than the second string
- “cake” as a search term should rank
- Stemming matters (plurals in particular)
- this currently does NOT work in our current implementation b/c we had to hack around Atlas to get proper AND matching for search terms
I have tried to achieve the above by using MongoDB Atlas, but have not been successfull. Hence I am looking at alternatives such as ElasticSearch, Solr etc for the same. I am also open to evaluating any other similar platform as well.
2
Answers
Ok, I came here because Meteor was tagged but I see this doesn’t seem to be Meteor related. Anyway, I am not sure you mentioned the type of search you are trying to do.
This is about the type of searches: https://www.youtube.com/watch?v=-sRcpGpd-0s
I think you need search indexes (as opposed to DB indexes) and the Atlas search function. This might be what you are using, I am not sure from your post. Some good details are here: https://www.youtube.com/watch?v=-sRcpGpd-0s and there is also a more recent video which I can’t find right now (Karen is the host of it). In this newer video released a month ago or so, Karen introduces this website which might help you in tuning your search indexes properly https://www.atlassearchsoccer.com/
Recommendations about a specific technology isn’t well-suited for Stack Overflow, since they’re usually limited to your specific use case. However, all you’re asking for can be solved in Elasticsearch and Solr. I’ll try to point you in the correct direction for each of your wishes in Solr.
You can do this by either using
edismax
withqf
set to the fields you want to search (and you can then score hits in each field differently by using thefield1^10 field2
syntax). You can also implement it by copying all the content from the fields you’re interested in into a common field (usually named_text_
by default). This won’t allow you to dynamically weigh hits from each of the fields differently.Implemented through the weights given for
qf
.Implemented by using pf/pf2/pf3 and ps/ps2/ps3 parameters, or by generating shingles when indexing as a spearate field and then using 1. or 2. to score that field differently.
You say penalized, but your description seems to indicate that you just shouldn’t consider the number of terms. Be aware that this will reduce effectiveness of scoring larger texts, where a single mention of a term will be the same as someone mentioning the term in multiple paragraphs. Solr won’t let you disable only termfreq when considering scoring, and since you want positions to be relevant you can’t use the
omitTermFreqAndPositions
setting for a field. You can apply a negative boost (which usually is just a larger boost to those document that doesn’t match) anything which have the query terms in quick succession after each other, but this will require manually tuning your queries and weights to get the results you’re looking for. There is no automagic solution for this that is "smart".Scoring will considers the length of the content in the field, and shorter content will be scored higher by default.
Stemming is configured on a per field basis, and you can use this to stem one field and not another, then index the same content into both fields (using a copy field instruction), and use 1/2 to apply different weights to the different fields.