skip to Main Content

I have a "trademarks" table like this:

create table
  public.trademarks (
    id bigint generated always as identity,
    owner_id text null,
    trademark text null,
    constraint trademarks_pkey primary key (id)
  ) tablespace pg_default;

My goal is to pass a text to a query and get all trademarks that the text contains. For example passing:

Walt Disney was an American entrepreneur, animator, and film producer
who became one of the most influential figures in the entertainment
industry. He co-founded the Disney Brothers Studio, which later became
The Walt Disney Company, and created beloved characters such as Mickey
Mouse and Donald Duck. Through his innovative and enduring works,
Disney left an indelible mark on the world of animation, theme parks,
and television, capturing the imagination of audiences for
generations.

Should return the rows with "Walt Disney", "Mickey Mouse", "Donald Duck". etc.

I came up with this solution which works, but I wanted to hear some expert opinions on this problem:

SELECT
  *
FROM
  trademarks
WHERE
'Walt Disney was an American entrepreneur, animator, and film producer who became one of the most influential figures in the entertainment industry. He co-founded the Disney Brothers Studio, which later became The Walt Disney Company, and created beloved characters such as Mickey Mouse and Donald Duck. Through his innovative and enduring works, Disney left an indelible mark on the world of animation, theme parks, and television, capturing the imagination of audiences for generations.' ILIKE '%' || trademark || '%';

2

Answers


  1. My understanding:

    1. You have a table with millions of records.
    2. This table has a column named ‘trademarks’.
    3. This column contains values like ‘Donald Dock’, ‘Mickey Mouse’, ‘Walt Disney’, and millions of other values.
    4. Your input is a chunk of text that may have multiple trademark occurrences within the text, which are present in 4 above.
    5. You want to find all the trademarks that are in 4 as well as in 5.
    6. The problem is that if the input text, for example, is "Walt Disney was an American entrepreneur", then you will have to search the DB for "Walt Disney", Disney was", "was an", "Was an American" …. lots of combinations. For example, if the longest trademark has 5 words, then you will have to take all 5 continuous 5 word occurrences in the input text and match all of them against the millions of existing trademarks in the table

    This does not sound advisable to do it directly in the DB.

    • You could probably create a read-only view that contains the
      trademarks in sorted order.
    • Then write a stored procedure / function
      to
    • take in the input text
    • break it into chunks
    • take combinations from each chunk
    • Match each of the combinations against the view
    • Round out the search by taking each chunk and taking combinations that cross a chunk’s boundary

    But for a real life maintainable solution,I think you will / should do it in the application layer – by using a tree-based model and parallelization of the searches involved

    Login or Signup to reply.
  2. Create a Full-Text Search Index:

    First, you need to create a full-text search index on the column you want to search within. Let’s assume you have a table named documents with a content column that contains the long text data. Here’s how you can create the index:

    CREATE INDEX trademark_search_idx ON trademarks USING gin(to_tsvector('english', trademark));
    

    (This index uses the to_tsvector function to convert the text data into a format optimized for full-text searching.)

    Perform Full-Text Searches:

        SELECT * FROM documents WHERE to_tsvector('english', content) @@ to_tsquery('english', 
    'Walt Disney was an American entrepreneur, animator, and film producer who became one of the most influential figures in the entertainment industry. He co-founded the Disney Brothers Studio, which later became The Walt Disney Company, and created beloved characters such as Mickey Mouse and Donald Duck. Through his innovative and enduring works, Disney left an indelible mark on the world of animation, theme parks, and television, capturing the imagination of audiences for generations.'
    );
    

    This query uses the @@ operator to match the text in the content column against the tsquery created from the search term.

    To Optimize Performance further:

    When dealing with a large number of records, you can further optimize performance by using features like pagination (LIMIT and OFFSET) to retrieve results in smaller chunks like:

        SELECT * FROM documents
        WHERE to_tsvector('english', content) @@ to_tsquery('english', 
    'Walt Disney was an American entrepreneur, animator, and film producer who became one of the most influential figures in the entertainment industry. He co-founded the Disney Brothers Studio, which later became The Walt Disney Company, and created beloved characters such as Mickey Mouse and Donald Duck. Through his innovative and enduring works, Disney left an indelible mark on the world of animation, theme parks, and television, capturing the imagination of audiences for generations.'
    )
        LIMIT 20 OFFSET 0;
    

    Additionally, consider using tools like caching and indexing strategies to speed up queries even further.

    Remember that full-text search is a powerful tool, but its performance depends on various factors, including the complexity of the search, the amount of data, and the server’s resources. It’s also a good practice to regularly analyze and monitor the performance of your queries to ensure they remain efficient as your data grows.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search