I’ve hit a problem with dockerized Apache Beam. When trying to run the container I am getting "No id provided."
message and nothing more. Here’s the code and files:
Dockerfile
FROM apache/beam_python3.8_sdk:latest
RUN apt update
RUN apt install -y wget curl unzip git
COPY ./ /root/data_analysis/
WORKDIR /root/data_analysis
RUN python3 -m pip install -r data_analysis/beam/requirements.txt
ENV PYTHONPATH=/root/data_analysis
ENV WORKER_ID=1
CMD python3 data_analysis/analysis.py
Code analysis.py
:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
def run():
options = PipelineOptions(["--runner=DirectRunner"])
with beam.Pipeline(options=options) as p:
p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x-1) | beam.Map(print)
if __name__ == "__main__":
run()
Commands:
% docker build -f Dockerfile_beam -t beam .
[+] Building 242.2s (12/12) FINISHED
...
% docker run --name=beam beam
2021/09/15 13:44:07 No id provided.
I found that this error message is most likely generated by this line: https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/python/container/boot.go#L98
But what does it mean? Which id is this? What am I missing?
2
Answers
This error is most likely happening due to your Docker image being based on the SDK harness image (
apache/beam_python3.8_sdk
). SDK harness images are used in portable pipelines; When a portable runner needs to execute stages of a pipeline that must be executed in their original language, it starts a container with the SDK harness and delegates execution of that stage of the pipeline to the SDK harness. Therefore, when the SDK harness boots up it is expecting to have various configuration details provided by the runner that started it, one of which is the ID. When you start this container directly, those configuration details are not provided and it crashes.For context into your specific use-case, let me first diagram out the different processes involved in running a portable pipeline.
Currently, the docker container you created is based on an SDK harness container, which does not sound like what you want. You seem to have been trying to containerize your pipeline construction code and accidentally containerized the SDK harness instead. But since you described that you want the ReadFromKafka consumer to be containerized, it sounds like what you need is for the Job Server to be containerized, in addition to any SDK harnesses it uses.
Containerizing the Job Server is possible, and may already be done. For example, here’s a containerized Flink Job Server. Containerized job servers may give you a bit of trouble with artifacts, as the container won’t have access to artifact staging directories on your local machine, but there may be ways around that.
Additionally, you mentioned that you want to avoid having SDK harnesses start in a nested docker container. If you start up a worker pool docker container for the SDK harness and set it as an external environment, the runner, assuming it supports external environments, will attempt to connect to the URL you supply instead of creating a new docker container. You will need to configure this for the Java cross-language environment as well, if that is possible in the Python SDK. This configuration should be done via python’s pipeline options.
--environment_type
and--environment_options
are good starting points.I encountered the same problem and what you could do is override the
ENTRYPOINT
like so:From there you’ll be able to get
/bin/sh
and play around to your heart’s content.