skip to Main Content

I’m trying to use a local docker image to run a Beam pipeline, but it looks like this image has not been recognized even after I follow the Beam documentation suggested steps (https://beam.apache.org/documentation/runtime/environments/).

I performed the following steps:

  1. Created a Dockerfile with my custom dependencias (pypostal and fuzzywuzzy):

Dockerfile

FROM apache/beam_python3.7_sdk:2.25.0

## System Dependencies
RUN apt-get update && apt-get upgrade -y && apt-get clean

ENV TZ=America
RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata

# Python package management and basic dependencies
RUN apt-get install -y curl python3.7 python3.7-dev python3.7-distutils build-essential graphviz git-all
# Register the version in alternatives
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.7 1
## Create User Directory
RUN mkdir -p /home/user


# LIBPOSTAL
# Install Libpostal dependencies
RUN apt-get update &&
    apt-get install -y 
        git 
        make 
        curl 
        autoconf 
        automake 
        libtool 
        pkg-config

# Download libpostal source to /usr/local/libpostal
RUN cd /usr/local && 
    git clone https://github.com/openvenues/libpostal

# Create Libpostal data directory at /var/libpostal/data
RUN cd /var && 
    mkdir libpostal && 
    cd libpostal && 
    mkdir data

# Install Libpostal from source
RUN cd /usr/local/libpostal && 
    ./bootstrap.sh && 
    ./configure --datadir=/var/libpostal/data && 
    make -j4 && 
    make install && 
  ldconfig

# Python Packages
COPY requirements.txt /requirements.txt 

# Install Pip Requirements
RUN pip install -r requirements.txt


ENV PYTHONPATH "${PYTHONPATH}:/home/user"

WORKDIR /home/user

requirements.txt

fuzzywuzzy
postal
  1. Created a pipeline.py file with the following Beam pipeline code:
import apache_beam as beam
import argparse
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

from postal.expand import expand_address
from postal.parser import parse_address
from fuzzywuzzy import fuzz

def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session


    addresses_examples = [
        "465 windward pkwy, alpharetta, georgia, u.s.a.",
        "2018 colby taylor drive, 40475, richmond, usa",
        "19-21 city road, chester ,chester ch1 3ae"
        "no.12 lishi hutong, chaoyangmen ; nei nanxiaoj",
        "building b, 25 yuan da road haidian district,"
    ]

    class ParseAddress(beam.DoFn):

        def process(self, text):
            yield parse_address(text)


    with beam.Pipeline(options=pipeline_options) as p:
        plants = (
          p
          | 'Adresses' >> beam.Create(addresses_examples)
          | 'Parser' >> beam.ParDo(ParseAddress())
          | beam.Map(print))

if __name__ == '__main__':
    run()
  1. Ran the script using the command:
python3 -m pipeline --runner=PortableRunner --environment_type="DOCKER" --environment_config="beam-text:0.1" --job_endpoint=embed

(beam-text:0.1 is my image name)

But I still receiving the error message:

No module named 'apache_beam'

It sounds that Beam is ignoring my custom container arguments.

2

Answers


  1. Add apache_beam to your requirements.txt. You need apache_beam to be installed both inside and outside of the container.

    Login or Signup to reply.
  2. If you use a virtual env to launch your you Beam job, you need to have the same Python packages installed in your virtual env and in the Docker image (of course also Beam Python extra GCP).

    The packages for the virtual env can be managed by the requirements.txt file or tools like Pipenv or Poetry.

    The runner from the virtual env will instantiate the job and then in the execution phase, the job will use the Docker image.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search