I have around 10 million + increasing users with Email and Phone numbers. Both pointing to an User ID. I created 2 Hashes. One for Email and other for Phone numbers like
//A single user with Both Email and Phone number pointing to same User ID
$redis->hSet('email-users', '[email protected]', 1);
$redis->hSet('phone-users', '+192938384849', 1);
Now as there are around millions of users, the Hash
is growing to overloaded and I also want to search through these Hashes. Like I want to get the User ID from an Email from email-users hash.
As I found that Hashes should be maintained with ZipList at Redis — best way to store a large map (dictionary) and divided into smaller buckets of a fixed size say max 10000 keys in a single Hash.
So, if I divide my 10 Million Users into buckets of 10000 keys there would be around 1000 Hashes for Emails and 1000 for Phone numbers.
My Questions is, Should I divide my users into these 1000 buckets? and if yes then how can I search through these 1000 buckets? Or is there a better alternative?
P.S. I am using PHP
and getting all 1000 Hashes and loop through them can be quite resource intensive and I am afraid that using a wrong approach would also kill the actual performance of Redis
Power.
Just for a side note, I think that we can create some algorithm like libketama for consistent hashing to place keys in random servers.
Also if it is hard to work on alphabats, we can convert each email to numbers first like a=1, b=2, c=3 … z=26 with 0 (Zero) appended for making it unique and +s for @ and . characters. For Example
[email protected] -> 10203040+901301090+3015013
So, now we have numbers which make it easier to apply any calculations.
2
Answers
what you may do is distribution of letters and numbers according to first or first couple of letters/digits.
you may create your hashes like this; email first letter, phone number first or first two digits
while you do hset/hget, you arrange this on code level.
Edit:
Let’s say we will use
first two digits
for phone numbers andfirst two letters
for email;then we will have keys like following;
When we have an email like
[email protected]
then we will go toer
email hash group which isemail-users-er
and executehget email-users-er [email protected]
.When we have phone number like
123456789
then we will go to12
phone hash group which isphone-users-12
and executehget phone-users-12 123456789
.Yes. The approach could work in the following way.
For this example, let’s treat both the phone numbers and email Ids as strings.
Let’s say you have the following buckets(Redis Hash):
Given an email Id, determine the bucket (max being 1000) by hashing the email Id. You can use consistent hashing for this purpose. Now add the key and value to the appropriate ‘bucket’.
Repeat step 1 for phone numbers. This prevents you from the need of bookeeping which key goes into which bucket. Given the same key, the hash will always give the same value thereby returning the same bucket number.
To retrieve a value, first find the bucket by hashing the email/phone and then looking up the value in appropriate bucket.
The above example is in Java, I’m assuming PHP would have similar libraries for hashing.