skip to Main Content

Had a question, i’m not sure if there is a REDIS structure (or group of structures?) that can handle this.
I already have a solution, but it doesnt scale at all…

So, I want to track the top 100 trending hashtags.
✅ This first part is easy using the TOPK REDIS structure.

❌ The problem is, that for each of those top trending hashtags, i want to track the 50 top liked posts.
So a result can look something like this


1.  #Cars       -> [p1, p10, p8.  ....etc till count of 50]
2.  #Politics   -> [p100, p6, p77 ....etc till count of 50]
.
.
.
100 #RandomVids -> [p22, p25, p0  ....etc till count of 50]

I guess there is no REDIS structure out of the box that can help with this in a scalable manner?

2

Answers


  1. There is no sketch data structure that directly supports your use case. Not in Redis, and most probably not in other libraries / data stores.

    One thing I can suggest is to hold a top-k sketch for each hashtag in the hashtags top-k sketch (you can retrieve this list using TOPK.LIST key).

    Two possible heuristic optimizations you can explore:

    • Call TOPK.LIST key after each m inserts – and update your set of ‘per-hashtag’ top-k’s according to the items in the hashtags list.
    • Use TOPK.LIST key WITHCOUNT and keep a ‘per-hashtag’ top-k only for hashtags with count > n.

    You should probably set the topk value in TOPK.RESERVE key topk for the hashtags list to p times the number you actually need.

    Note that a theoretical expected error analysis would be quite complex.

    Login or Signup to reply.
    1. Do aTOPK.RESERVE for top_hashes. This will store the top hashes Cars, Politics… RandomVids

    2. When a new post comes in with a hashtag, add the hashtag to top_hashes. Example, say a post with #cars comes in, add #cars to top_hashes.

    3. Check to see if #cars was added. If it wasn’t, then ignore. But if it was, then do another TOPK.RESERVE for #cars and then add the post_id to #cars.

    So now we tracking top_hashes and also the posts within each hash.

    When adding a hash to top_hashes using the TOPK.ADD command, if an item was popped off the top list, it will be returned. Check this return value to know if a hash tag popped off the list. If one was popped off the list, then you can go ahead and delete its corresponding hash from the other reserve. So example, if #dogs popped #cars off the top_hashes list, then also go ahead and delete its #cars that was reserved earlier because its no longer necessary to track it.

    This setup lets you track both top_hashes and also the corresponding posts connected to each hash. It also cleans up the corresponding posts connected to each hash when its no longer on the top_hashes list.

    The only issue is at the ends. When example #cars gets poped of the top_hashes, all its tracked posts will deleted. if it is re-added, then all the posts will have to be tracked all over again.
    A way to solve this is if you are keeping track of say top 10 hashes publicly, then internally keep track of say top 15 (or more). That way, you will not lose posts and have bad user experience.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search