skip to Main Content

Implementing a simple fulltext search I encountered a problem with the combination of boolean mode and phrases. Also worth noting is that the column has a binary collation (utf8_bin) whilst the table does not have this.

Given the following setup:

CREATE TABLE `test` (
  `test_id` int(11) NOT NULL AUTO_INCREMENT,
  `text_bin` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`test_id`),
  FULLTEXT KEY `text_bin` (`text_bin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO `test` (`test_id`, `text_bin`) VALUES
(1, 'Lorem Ipsum Dolor Sit Amet.'),
(2, 'Consectetuer Adipiscing Elit.'),
(3, 'Amet Sit Dolor Ipsum Lorem.')
;

Then running this query:

SELECT t.test_id, t.text_bin,
  MATCH(t.text_bin) AGAINST ('Lorem Ipsum' IN BOOLEAN MODE) as m_Words,
  MATCH(t.text_bin) AGAINST ('"Lorem Ipsum"' IN BOOLEAN MODE) as m_Phrase,
  MATCH(t.text_bin) AGAINST ('Lorem' IN BOOLEAN MODE) as m_Lorem,
  MATCH(t.text_bin) AGAINST ('Ipsum' IN BOOLEAN MODE) as m_Ipsum
FROM test t
;

This yields the following results:

test_id text_bin m_Words m_Phrase m_Lorem m_Ipsum
1 Lorem Ipsum Dolor Sit Amet. 0.0620 0 0.0310 0.0310
2 Consectetuer Adipiscing Elit. 0 0 0 0
3 Amet Sit Dolor Ipsum Lorem. 0.0620 0 0.0310 0.0310

(Note: I shortened the numbers to 4 decimal places for better readability.)

For the column m_Phrase I would expect a value greater then 0 on the first row. Is this a bug or can someone explain why the result is 0?

DB Fiddle: https://www.db-fiddle.com/f/8qxR3SiPVtESU3saebhgBG/0

2

Answers


  1. My gut instinct is that this is to do with space delimiter.
    The same happens with utf8_unicode_ci

    Eg in this fiddle different collations give different results with both case sensitivity and matching of spaces as part of literals – https://www.db-fiddle.com/f/pi78uuA1RCFeToaRH9skcK/2

    Update – Fiddle above updated with special chars (.#*), which match phrases as expected.

    I don’t know the solution to your problem other than use a different collation – just use utf8_general_ci

    It would be interesting to know your use case or even the performance considerations that you face that (assumably) prompt you to use utf8_bin.


    Update 2:

    This looks like a reported bug. Although, from 2018… One fix appears to be to downgrade to version 5.7.19.

    It doesn’t appear as though there is a working latest version approach for both a FULLTEXT based query with literal phrases and case sensitivity. Very very strange. The simplest (not perfect) approach would be the prepend each word with a plus, +Lorem +Ipsum, but would obviously lead to some false positives. (See above fiddle with m_all and m_AllLower fields.

    Login or Signup to reply.
  2. Still failing in 8.0.31.

    This might be a workaround:

    WHERE MATCH(t.text_bin) AGAINST ('+Lorem +Ipsum' IN BOOLEAN MODE)
      AND t.text_bin LIKE '%Lorem Ipsum%'
    

    That is, avoid the phrase test in FT and use LIKE as a secondary filter to check for the phrase.

    Alas, it won’t let you display the relevance.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search