skip to Main Content

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
– choosing the item is based on some user info,
– I need to make sure that each user take a unique item that wasn’t taken by any one else, so relation is one-to-one

To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user

actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.

I need to improve this, one of the things I’ve thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:

for(User user : users){
    getMyItem(user.getId());
}

where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.

Let’s say I’ll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?

Since Im using basic scaling and 8 instances, can’t rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)

Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.

I hope you guys can help to find a solution for this problem, thanks in advance 🙂

2

Answers


  1. If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
    Userlist1: [“vijay”,…]
    Userlist2: [“ramana”,…]

    Login or Signup to reply.
  2. First – I would advice against using solely memcache for such algorithm – the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.

    From Service levels:

    Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
    cache’s LRU policy. Changes in the cache configuration or datacenter
    maintenance events can also flush some or all of the cache.

    And from How cached data expires:

    Under rare circumstances, values can also disappear from the cache
    prior to expiration for reasons other than memory pressure. While
    memcache is resilient to server failures, memcache values are not
    saved to disk, so a service failure can cause values to become
    unavailable.

    I’d suggest adding a property, let’s say called assigned, to the item entities, by default unset (or set to null/None) and, when it’s assigned to a user, set to the user’s key or key ID. This allows you:

    • to query for unassigned items when you want to make assignments
    • to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
    • to be certain that an item can uniquely be assigned to only a single user
    • to easily find items assigned to a certain user if/when you’re doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes

    Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.

    As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.

    To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it’s python, but the general idea is the same).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search