I have a large MySQL table with 20mm records of sales and 10 or so columns. Two of the columns are POSTCODE and STREET. I have configured the table so that i have a FULLTEXT index defined on these 2 columns. My problem is when I search this the table using the following query, I get a bunch of random results where I expect to get much closer matches… for example:
SELECT * FROM table1 WHERE MATCH(postcode, street) AGAINST ('W4 RAVEN' IN NATURAL LANGUAGE MODE)
Yields a bunch of results (see top 5 below)
postcode | street |
---|---|
WS5 3PZ | RAVEN ROAD |
WD18 7DA | RAVEN CLOSE |
DE55 5NR | RAVEN AVENUE |
IP28 7LF | RAVEN CLOSE |
WD18 7DB | RAVEN CLOSE |
When in fact, if I run this query, I get the below result, which looks like a much more relevant result than the above to me… ?
SELECT postcode, street from table1 where postcode LIKE "W4 5EQ%" LIMIT 5
postcode | street |
---|---|
W4 5EQ | RAVENSCROFT ROAD |
W4 5EQ | RAVENSCROFT ROAD |
W4 5EQ | RAVENSCROFT ROAD |
W4 5EQ | RAVENSCROFT ROAD |
W4 5EQ | RAVENSCROFT ROAD |
Am I misunderstanding how the search works?
I would expect both of the queries above to return very similar results. The reason I want to use the first one though is because sometimes the user may just search RAVEN and sometimes just W4, so I dont want to limit them to the column type.
Edit to clarify: Given my table has records similar to what the second query is giving, I would have expected the first query to give something more similar to that than what it returned (a ‘closer’ match)
2
Answers
I’m sorry to write that comparing the two queries’ results is like comparing apples with pears!
In the first query you search for a part of a postcode and a part of the street name. Not sure why you are surprised that it returns rows where the searched string appears in the street name column! Not to mention the fact that W4 is probably ignored as it is too short for a word. The default minimum word length is 3 for innodb tables. If you reduced this limit to 2 and rebuilt the fulltext index, then your search results coud also be more accurate.
In the second query you search for a specific postcode only, therefore your results will be more specific as well.
Please remember, fulltext search does not behave like google search! It does not use machine learning or similar data scince techniques to guess the relationship between different parts of the searched expression!
IN NATURAL MODE
is inappropriate for addresses. SuggestIN BOOLEAN MODE
, perhaps with plus (+
) in front of each term.innodb_ft_min_token_size
value is 3. Hence, "W4" will be ignored. You could change that, however, it would mean rebuilding all FT indexes.+W4*
.