For example, lets say I want to enrich a collection with more values from a lookup table, by joining on a key.
In the spark runner, I would prefer to do a broadcast join for this operator where as in the flink runner, I would like to make rpc calls (say to redis) to load the values based on the key.
So is there a way to achieve this? Same logical semantics but different execution based on the runner.
2
Answers
In any case, an internal Beam implementation for different transforms depends on runner but if you need it for your own transforms than you can use
PipelineOptions
to get a runner name and decide which code path to take.This is deliberately not part of Beam. The purpose of Beam is provide a portable programming model, so producing transforms that are not portable works against the goals of the project.
From within a
DoFn
you can run arbitrary code, so integrating with the storage system of your choice is easy. But be careful when you do this since getting correct exactly once behavior requires design that takes into account parallelism, retries, checkpointing, etc.