Implementing a simple fulltext search I encountered a problem with the combination of boolean mode and phrases. Also worth noting is that the column has a binary collation (utf8_bin) whilst the table does not have this.
Given the following setup:
CREATE TABLE `test` (
`test_id` int(11) NOT NULL AUTO_INCREMENT,
`text_bin` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`test_id`),
FULLTEXT KEY `text_bin` (`text_bin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `test` (`test_id`, `text_bin`) VALUES
(1, 'Lorem Ipsum Dolor Sit Amet.'),
(2, 'Consectetuer Adipiscing Elit.'),
(3, 'Amet Sit Dolor Ipsum Lorem.')
;
Then running this query:
SELECT t.test_id, t.text_bin,
MATCH(t.text_bin) AGAINST ('Lorem Ipsum' IN BOOLEAN MODE) as m_Words,
MATCH(t.text_bin) AGAINST ('"Lorem Ipsum"' IN BOOLEAN MODE) as m_Phrase,
MATCH(t.text_bin) AGAINST ('Lorem' IN BOOLEAN MODE) as m_Lorem,
MATCH(t.text_bin) AGAINST ('Ipsum' IN BOOLEAN MODE) as m_Ipsum
FROM test t
;
This yields the following results:
test_id | text_bin | m_Words | m_Phrase | m_Lorem | m_Ipsum |
---|---|---|---|---|---|
1 | Lorem Ipsum Dolor Sit Amet. | 0.0620 | 0 | 0.0310 | 0.0310 |
2 | Consectetuer Adipiscing Elit. | 0 | 0 | 0 | 0 |
3 | Amet Sit Dolor Ipsum Lorem. | 0.0620 | 0 | 0.0310 | 0.0310 |
(Note: I shortened the numbers to 4 decimal places for better readability.)
For the column m_Phrase
I would expect a value greater then 0 on the first row. Is this a bug or can someone explain why the result is 0?
DB Fiddle: https://www.db-fiddle.com/f/8qxR3SiPVtESU3saebhgBG/0
2
Answers
My gut instinct is that this is to do with
space
delimiter.The same happens with
utf8_unicode_ci
Eg in this fiddle different collations give different results with both case sensitivity and matching of spaces as part of literals – https://www.db-fiddle.com/f/pi78uuA1RCFeToaRH9skcK/2
Update – Fiddle above updated with special chars (
.#*
), which match phrases as expected.I don’t know the solution to your problem other than use a different collation – just use
utf8_general_ci
It would be interesting to know your use case or even the performance considerations that you face that (assumably) prompt you to use
utf8_bin
.Update 2:
This looks like a reported bug. Although, from 2018… One fix appears to be to downgrade to version
5.7.19
.It doesn’t appear as though there is a working latest version approach for both a FULLTEXT based query with literal phrases and case sensitivity. Very very strange. The simplest (not perfect) approach would be the prepend each word with a plus,
+Lorem +Ipsum
, but would obviously lead to some false positives. (See above fiddle withm_all
andm_AllLower
fields.Still failing in 8.0.31.
This might be a workaround:
That is, avoid the phrase test in FT and use LIKE as a secondary filter to check for the phrase.
Alas, it won’t let you display the relevance.