skip to Main Content

I am currently using Redis as a vector database and was able to get a similarity search going with 3 dimensions (the dimensions being latitude, longitude, and timestamp). The similarity search is working but I would like to weigh certain dimensions differently when conducting the search. Namely, I would like the similarity search to prioritize the timestamp dimension when conducting the search.

How would I go about this? Redis does not seem to have any built-in feature that does this.

I turn each set of lat, long, and time coordinates into bytes that can be put into the vector database with the following code. Note that vector_dict stores all the sets of lat, long, and timestamp:

p = client.pipeline(transaction=False)
for index in data:
        # create hash key
        key = keys[index]

        # create hash values
        item_metadata = data[index] # copy all metadata
        item_key_vector = np.array(vector_dict[index]).astype(np.float32).tobytes() # convert vector to bytes
        p.hset(key, mapping=item_metadata) # add item to redis using hash key and metadata

I then conduct the similarity search using the HNSW index here:

def create_hnsw_index(redis_conn, vector_field_name, number_of_vectors, vector_dimensions=3, distance_metric='L2', M=100, EF=100):
    redis_conn.ft().create_index([
        VectorField(vector_field_name, "HNSW", {"TYPE": "FLOAT32", "DIM": vector_dimensions, "DISTANCE_METRIC": distance_metric, "INITIAL_CAP": number_of_vectors, "M": M, "EF_CONSTRUCTION": EF})
    ])

I talked with others and they said it is a math problem that deals with vector normalization. I’m unsure how to get started with this though in code and would like some guidance.

2

Answers


  1. You can re-weight the vector to make certain dimensions longer than others. You’re using an L2 distance metric. That uses the standard Pythagorean theorem to calculate distance:

    dist = sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2)
    

    Imagine you multiplied every Y value, in both your query and your database, by 10. That would also multiply the difference between Y values by a factor of 10.

    The new distance function would effectively be this:

    dist = sqrt((x1-x2)**2 + (10*(y1-y2))**2 + (z1-z2)**2)
    
    dist = sqrt((x1-x2)**2 + 100*(y1-y2)**2 + (z1-z2)**2)
    

    …which makes the Y dimension matter 100 times more than the other dimensions.

    So if you want the dimension in position 2 to matter more, you could do this:

    item_key_vector = np.array(vector_dict[index])
    item_key_vector[2] *= 10
    item_key_vector_bytes = item_key_vector.astype(np.float32).tobytes()
    

    The specific amount to multiply by depends on how much you want the timestamp to matter. Remember that you need to multiply your query vector by the same amount.

    Login or Signup to reply.
  2. Not an answer to your question, but note that using {latitude, longitude} as vector elements is not a good method if you are looking for vector similarity.

    The distance between pair of points such as {−179.99, 0} and {+179.99,0} or {0, 89.99} and {180, 89.99} is small, but the vectors wouldn’t be similar.

    It is usually better to convert each {latitude, longitude} pair into Cartesian {x, y, z}. See the excellent answer here.

    You cannot map an ellipsoid surface to a vector space such that distances are preserved, but using Cartesian coordinates is a simple method to tackle the wrap-around property of longitudes.

    As for your question, Nick ODell’s answer is the standard way to solve such problems: scale the dimensions of your vectors (and queries).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search