I have a VM (or auto-scaling MIG, if it makes more sense), a pub-sub pull subscription as the input, and a pub-sub topic as the output. I want this VM to dedup messages.
The input items are often repeated within a long (years) time frame. In other words: if a message with a key = "abcde123456789"
was processed any time before, then it should be acked and nothing should be published to the output topic. Otherwise – publish the input message to the output topic as is.
The number of unique key
s is under ~10M, it can grow to approximately 1B over time.
The input is spiky, varies from 1 to 500 messages per second.
Low latency is preferable, but it’s not critical as long as it doesn’t exceed 30 minutes.
Cost efficiency is critical.
The unique messages keys are persistently stored in BigQuery.
Possible solutions
- Pull unique keys from GBQ to a Memorystore instance once, then check it for each message
key
presence. - Persistent key-value stored on a persistent SSD. This solution is, presumably, worse than Memorystore at every aspect.
- Run GBQ aggregations once every 30 minutes. This works, but I wonder if I can achive better cost efficiency and latency with other solutions.
- Anything else?
2
Answers
If you look at your option 3 there is room to expand to make it more cost efficient with lower latency.
In this scenario you would end up with 3 objects, 2 tables and one view.
Table_A can serve as your raw data, no deduping just straight inserts off the pubsub subscription. This table is partitioned by either ingestion time or another timestamp relative to the message.
Table_B is aggregated nightly and handles the deduplication process of the previous days data with a
MERGE
.View_C unions both table_b and the latest day of partition of table_a that is deduplicated on the fly. Then using a set operator like
UNION or INTERSECT DISTINCT
you would get a deduplicated dataset with low latency and a lower cost footprint than running the aggregation every 30 m.Loosely your view would look something like this:
example diagram
This would be very low latency as once data is written it is available in the view. From a cost perspective you would be minimizing the logic and processing required as well as being able to get rid of the VM to publish data and the second pubsub topic.
EDIT:
Adding link from comments to similar use case for CDC streaming that has a similar solution and explanation of the approach.
https://cloud.google.com/architecture/database-replication-to-bigquery-using-change-data-capture#query_the_data
Your Possible Solution 1:
If you are interested in deduplication using Redis a good choice might be to use a Bloom Filter, which uses a probabilistic algorithm that can be useful for deduplication.
A Bloom Filter is not available in GCP Memorystore but is offered in Redis Enterprise which can be also be deployed on GCP as a managed service. I have provided a link with a tutorial below:
https://redis.com/solutions/use-cases/deduplication/#:~:text=The%20RedisBloom%20module%20provides%20Bloom,the%20memory%20a%20set%20requires.
You mention cost as a factor. You could deploy this for as little as $6 per month.
https://redis.com/redis-enterprise-cloud/pricing/