What is the efficient way to iterate through redis data?

SabbirTalukdar
January 12, 2022
290 views
0 votes
2 Answers

Hello i have a redis database that contains facial embeddings of 100k+ people. All of these are stored in redis as key-value pairs. For example:

{
   "embedding:angelina" : [128-D vector of angelina],
   "embedding:emma" : [128-D vector of emma],
   "embedding:dicaprio" : [128-D vector of dicaprio]
}

Now I am trying to compare a target-embedding with all of the embeddings in my dataset to find the best match. One way i am trying to do it is to retrieve all keys starting with embedding* expression first. Then, iterate over those embeddings and find the distance with the target-embedding. If the distance is less than the threshold, then we will append it to a list, and then choose the shortest distance from that list.
I dont know, but I have a feeling that this is not a best practice. I would be glad if someone could help me find a better approach?

Note: I know ElasticSearch is a great candidate for such tasks, but I need to stick with redis for now.

Answers

- AlexHurst
- January 12, 2022 at 7:48 am
- 0 votes
0
Iterating over Redis keys by pattern is possible, but it’s not a best practice. The Redis docs warn the following:

Warning: consider KEYS as a command that should only be used in production environments with extreme care. It may ruin performance when it is executed against large databases. This command is intended for debugging and special operations, such as changing your keyspace layout. Don’t use KEYS in your regular application code. If you’re looking for a way to find keys in a subset of your keyspace, consider using SCAN or sets.

Using SCAN will protect the Redis instance’s resources, but it will still take you a long time and many requests to use SCAN to get all the keys in a large dataset.

Some workarounds come to mind, depending on your situation:
1. Know which keys you’re going in after. You could do this by:
- Maintaining a list of keys elsewhere so you can do key-by-key lookups, which might not be worth the trouble if you’re going to inspect a high percentage of the keys anyway.
- Numbering the keys sequentially: if you know you’re going to have 100k-200k keys and you don’t expect to delete many, you could just make the keys "embedding:{1-100k}" and store any other metadata like angelina in the value if it’s important.
- Maintain a list of the keys in a Redis Set: this seems to be the suggestion in the quoted warning above. When you add or remove an embedding:.* key in the dataset, also use SADD and SREM to add or remove the key name in a Set (could be named e.g. embedding_sample_keys). I haven’t tried this but it sounds pretty viable.
1. Use a hash: this would store all your data under a single Redis hash structure. (possibly the key would be embedding_data). This has downsides like not being able to set a distinct TTL for each cache key. You can use HKEYS and HSCAN to access all the keys in a hash, which might be an improvement over scanning the entire dataset.
Login or Signup to reply.