skip to Main Content

For example, lets say I want to enrich a collection with more values from a lookup table, by joining on a key.
In the spark runner, I would prefer to do a broadcast join for this operator where as in the flink runner, I would like to make rpc calls (say to redis) to load the values based on the key.

So is there a way to achieve this? Same logical semantics but different execution based on the runner.

2

Answers


  1. In any case, an internal Beam implementation for different transforms depends on runner but if you need it for your own transforms than you can use PipelineOptions to get a runner name and decide which code path to take.

    Login or Signup to reply.
  2. This is deliberately not part of Beam. The purpose of Beam is provide a portable programming model, so producing transforms that are not portable works against the goals of the project.

    From within a DoFn you can run arbitrary code, so integrating with the storage system of your choice is easy. But be careful when you do this since getting correct exactly once behavior requires design that takes into account parallelism, retries, checkpointing, etc.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search