skip to Main Content

I have a Django model representing a website, and I use the Postgres Full Text Search to search the Website objects. However, if you search for only part of a website’s url, it doesn’t return matching Websites.

e.g. if a Website has a url of "https://www.example.com/foo/" and you search for "example" or "foo", then that Website isn’t returned in results.

I could create a new field, that’s populated on save, which contains the url, but with parts separated by spaces. e.g. "www example com foo". And then index that. But then this wouldn’t work if the user searched for "example.com".

Is there some way I can use Full Text Search to search for partial matches on the url field, as well as still searching title and description?

My model:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVectorField
from django.db import models

class Website(models.Model):
    title = models.CharField(blank=True, null=False, max_length=255)
    url = models.URLField(blank=False, null=False, max_length=255)
    description = models.CharField(blank=True, null=False, max_length=1000)

    search_document = SearchVectorField(null=True)

    class Meta:
        indexes = [GinIndex(fields=["search_document"])]

    def index_components(self):
        return (
            (self.title, "A"),
            (self.url, "B"),
            (self.description, "C"),
        )

Not sure this is relevant, but for completeness, there’s a post save signal that updates the index, something like this from here:

from django.dispatch import receiver
from django.db.models.signals import post_save, m2m_changed
from django.db.models import Value, TextField
from django.contrib.postgres.search import SearchVector
from django.db import transaction
import operator
from functools import reduce

@receiver(post_save)
def on_save(sender, **kwargs):
    transaction.on_commit(make_updater(kwargs["instance"]))

def make_updater(instance):
    components = instance.index_components()
    pk = instance.pk

    def on_commit():
        search_vectors = []
        for weight, text in list(components.items()):
            search_vectors.append(
                SearchVector(Value(text, output_field=TextField()), weight=weight)
            )
        instance.__class__.objects.filter(pk=pk).update(
            search_document=reduce(operator.add, search_vectors)
        )

    return on_commit

2

Answers


  1. To do that naturally you would have to define a custom parser for the FTS and make a configuration to use that custom parser. In addition to being unreasonably difficult to do, it would also interfere with your ability to use it alongside other (non-URL) fields, which I think is what your Django code is trying to do.

    With the default parser, you can use ts_debug to see that ‘foo’ and ‘example’ are not returned as stand-alone tokens:

     select * from ts_debug('https://www.example.com/foo/');
      alias   |  description  |        token         | dictionaries | dictionary |        lexemes
    ----------+---------------+----------------------+--------------+------------+------------------------
     protocol | Protocol head | https://             | {}           | (null)     | (null)
     url      | URL           | www.example.com/foo/ | {simple}     | simple     | {www.example.com/foo/}
     host     | Host          | www.example.com      | {simple}     | simple     | {www.example.com}
     url_path | URL path      | /foo/                | {simple}     | simple     | {/foo/}
    

    But ‘foo’ is returned as the token ‘/foo/’, so maybe some kind of custom stemming rule could be applied to this.

    You could process the URL column through a function which changed all punctuation into spaces, before passing that to the to_tsvector function. But I don’t know how you would interface this with Django.

    Login or Signup to reply.
  2. There has been some discussion to support pg_bm25 on the Django forums. That would allow you to do this easily without needing to define a custom parser

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search