I’m currently looking to replace a flawed implementation of the fork-join-queue described in the following video:
https://youtu.be/zSDC_TU7rtc?t=33m37s
I realize that this video is nearly eight years old now, and I would be very happy to learn of any potential new and better ways to do such things, but for now I’m focusing on trying to make this work as described by Brett. As of right now, what’s in front of me is a bit of a mess.
One of the things the original developer did differently from Brett is that he puts all work items for a particular sum_name
into a single entity group.
I’m still relatively new to Datastore, but to me it seems like that defeats the entire purpose, since adding a new entity to the entity group several times a second will cause contention, which is the very thing we’re trying to avoid by batching changes.
As for why someone would try to put all the work in a single entity group, the original developer’s comments are clear– He’s trying to prevent work items from getting skipped due to eventual consistency. This led me to really dig into Brett’s implementation, and I’m very puzzled because it does seem like a problem Brett is not considering.
Put simply, when Brett’s task queries for work items, the index it is using may not be fully up to date. Sure, the lock he’s doing with memcache should make this unlikely, since the start of the task will prevent more work items from being added to that index. However, what if the index update time is long enough such that something is written before the lock decrements, but still doesn’t come back in the query results? Won’t such a work item wind up just hanging out in the datastore, never to be consumed?
Is there some facet of Brett’s implementation that deals with this that I’m not seeing? Obviously Brett knows what he is doing and was very confident in this, so I feel like I must be missing something.
If not, though, how might one go about handling this?
2
Answers
I think that bit about the hashing is how he avoids that, to ‘Distribute the load across Bigtable’ and avoid all the work entries from ‘writing to the beginning of the same row’: https://youtu.be/zSDC_TU7rtc?t=48m40s
Even if that’s not it, his tribal knowledge of the inner workings of Datastore does seem to be coming into play here.
You could have your task launch another task, which performs an integrity check 10 seconds later.
He actually does mention having an offline job pick up any data that dropped if you have a ‘lossy data model’ https://youtu.be/zSDC_TU7rtc?t=48m40s
Based on the date of the talk, the talk assumed the master/slave Datastore. The talk is from 2010, but high replication Datastore (https://googleappengine.blogspot.com/2011/01/announcing-high-replication-datastore.html) wasn’t released until 6 months later.
One way around the entity group contention would be to manually create the work item key with something like task-name-INDEX and in the task do a get an all the keys from task-name-0 to task-name-TOP_INDEX and top index can probably be stored in memcache.