stkvtflw
June 15, 2022
168 views
0 votes
2 Answers

I have a VM (or auto-scaling MIG, if it makes more sense), a pub-sub pull subscription as the input, and a pub-sub topic as the output. I want this VM to dedup messages.

The input items are often repeated within a long (years) time frame. In other words: if a message with a key = "abcde123456789" was processed any time before, then it should be acked and nothing should be published to the output topic. Otherwise – publish the input message to the output topic as is.

The number of unique keys is under ~10M, it can grow to approximately 1B over time.

The input is spiky, varies from 1 to 500 messages per second.

Low latency is preferable, but it’s not critical as long as it doesn’t exceed 30 minutes.

Cost efficiency is critical.

The unique messages keys are persistently stored in BigQuery.

Possible solutions

Pull unique keys from GBQ to a Memorystore instance once, then check it for each message key presence.
Persistent key-value stored on a persistent SSD. This solution is, presumably, worse than Memorystore at every aspect.
Run GBQ aggregations once every 30 minutes. This works, but I wonder if I can achive better cost efficiency and latency with other solutions.
Anything else?

Answers

- DanielZagales
- June 15, 2022 at 12:20 pm
- 0 votes
0
If you look at your option 3 there is room to expand to make it more cost efficient with lower latency.

Run GBQ aggregations once every 30 minutes. This works, but I wonder
if I can achieve better cost efficiency and latency with other
solutions.

In this scenario you would end up with 3 objects, 2 tables and one view.

Table_A can serve as your raw data, no deduping just straight inserts off the pubsub subscription. This table is partitioned by either ingestion time or another timestamp relative to the message.

Table_B is aggregated nightly and handles the deduplication process of the previous days data with a MERGE.

View_C unions both table_b and the latest day of partition of table_a that is deduplicated on the fly. Then using a set operator like UNION or INTERSECT DISTINCT you would get a deduplicated dataset with low latency and a lower cost footprint than running the aggregation every 30 m.

Loosely your view would look something like this:
```
select distinct * 
from table_a 
where PARTITION_COLUMN = CURRENT_DATE() {UNION DISTINCT | INTERSECT DISTINCT}
select * from table_b
```
example diagram

This would be very low latency as once data is written it is available in the view. From a cost perspective you would be minimizing the logic and processing required as well as being able to get rid of the VM to publish data and the second pubsub topic.

EDIT:
Adding link from comments to similar use case for CDC streaming that has a similar solution and explanation of the approach.

https://cloud.google.com/architecture/database-replication-to-bigquery-using-change-data-capture#query_the_data
Login or Signup to reply.

- brandog
- July 8, 2022 at 6:39 pm
- 0 votes
0
Your Possible Solution 1:
1. Pull unique keys from GBQ to a Memorystore instance once, then check it for each message key presence.
If you are interested in deduplication using Redis a good choice might be to use a Bloom Filter, which uses a probabilistic algorithm that can be useful for deduplication.
A Bloom Filter is not available in GCP Memorystore but is offered in Redis Enterprise which can be also be deployed on GCP as a managed service. I have provided a link with a tutorial below:

https://redis.com/solutions/use-cases/deduplication/#:~:text=The%20RedisBloom%20module%20provides%20Bloom,the%20memory%20a%20set%20requires.

You mention cost as a factor. You could deploy this for as little as $6 per month.

https://redis.com/redis-enterprise-cloud/pricing/
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Redis – Google Cloud: How to deduplicate streaming data within a very long time frame?

Possible solutions

Answers