skip to Main Content

Using docker to test and develop an ETL data pipeline with Airflow and AWS glue. I’m currently using this blog post as a guide to launch the containers: https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1 (Dockerfile github link: https://github.com/jnshubham/aws-glue-local-etl-docker/blob/master/Dockerfile). When I run docker build -t glue:latest I get the error below. The error is caused by RUN pip install 'apache-airflow[postgres]'==1.10.10 --constraint https://raw.githubusercontent.com/apache/airflow/1.10.10/requirements/requirements-python3.7.txt within the dockerfile. I’ve googled solutions for the first error and tried adding RUN yum install -y python3-devel to the dockerfile but still got the same error. I’ve also read that it may have to do with the gcc version. Currently it’s:

Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

docker build -t glue:latest . Error:

    Running setup.py install for psutil: started
    Running setup.py install for psutil: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3.6 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ndmkn_ag/psutil/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ndmkn_ag/psutil/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'rn'"'"', '"'"'n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-nduz8awp/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.6m/psutil
    gcc -pthread -Wno-unused-result -Wsign-compare -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -D_GNU_SOURCE -fPIC -fwrapv -fPIC -DPSUTIL_POSIX=1 -DPSUTIL_SIZEOF_PID_T=4 -DPSUTIL_VERSION=570 -DPSUTIL_LINUX=1 -DPSUTIL_ETHTOOL_MISSING_TYPES=1 -I/usr/include/python3.6m -c psutil/_psutil_common.c -o build/temp.linux-x86_64-3.6/psutil/_psutil_common.o
    unable to execute 'gcc': No such file or directory
    Traceback (most recent call last):
      File "/usr/lib64/python3.6/distutils/unixccompiler.py", line 127, in _compile
        extra_postargs)
      File "/usr/lib64/python3.6/distutils/ccompiler.py", line 909, in spawn
        spawn(cmd, dry_run=self.dry_run)
      File "/usr/lib64/python3.6/distutils/spawn.py", line 36, in spawn
        _spawn_posix(cmd, search_path, dry_run=dry_run)
      File "/usr/lib64/python3.6/distutils/spawn.py", line 159, in _spawn_posix
        % (cmd, exit_status))
    distutils.errors.DistutilsExecError: command 'gcc' failed with exit status 1
    

My dockerfile consist of:

FROM centos as glue
# initialize package env variables
ENV MAVEN=https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
ENV SPARK=https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
ENV GLUE=https://github.com/awslabs/aws-glue-libs.git
#install required packages needed for aws glue
RUN yum install -y python3 java-1.8.0-openjdk java-1.8.0-openjdk-devel tar git wget zip

RUN yum install -y python3-devel

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip
RUN mkdir /usr/local/glue
WORKDIR /usr/local/glue
RUN git clone -b glue-1.0 $GLUE
RUN wget $SPARK
RUN wget $MAVEN
RUN tar zxfv apache-maven-3.6.0-bin.tar.gz
RUN tar zxfv spark-2.4.3-bin-hadoop2.8.tgz
RUN rm spark-2.4.3-bin-hadoop2.8.tgz
RUN rm apache-maven-3.6.0-bin.tar.gz
RUN mv $(rpm -q -l java-1.8.0-openjdk-devel | grep "/bin$" | rev | cut -d"/" -f2- |rev) /usr/lib/jvm/jdk
ENV SPARK_HOME /usr/local/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
ENV MAVEN_HOME /usr/local/glue/apache-maven-3.6.0
ENV JAVA_HOME /usr/lib/jvm/jdk
ENV GLUE_HOME /usr/local/glue/aws-glue-libs
ENV PATH $PATH:$MAVEN_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin:$GLUE_HOME/bin
RUN sh aws-glue-libs/bin/glue-setup.sh
#compile dependencies with maven build
RUN sed -i '/mvn -f/a rm /usr/local/glue/aws-glue-libs/jarsv1/netty-*' /usr/local/glue/aws-glue-libs/bin/glue-setup.sh
RUN sed -i '/mvn -f/a rm /usr/local/glue/aws-glue-libs/jarsv1/javax.servlet-3.*' /usr/local/glue/aws-glue-libs/bin/glue-setup.sh
#clean tmp dirs
RUN yum clean all
RUN rm -rf /var/cache/yum

ENV AIRFLOW_HOME /usr/local/airflow

WORKDIR /usr/local/src

COPY requirements.txt ./

RUN pip install --upgrade pip && 
    pip install --no-cache-dir -r requirements.txt && 
    pip install 'apache-airflow[postgres]'==1.10.10 
    --constraint https://raw.githubusercontent.com/apache/airflow/1.10.10/requirements/requirements-python3.7.txt

RUN mkdir glue_etl_scripts
COPY glue_etl_scripts/log_data.py glue_etl_scripts/log_data.py

RUN mkdir config
COPY config/aws.cfg /config/aws.cfg
COPY config/airflow.cfg $AIRFLOW_HOME/airflow.cfg

RUN mkdir scripts
COPY scripts/entrypoint.sh scripts/entrypoint.sh
COPY scripts/connections.sh scripts/connections.sh

ENTRYPOINT ["scripts/entrypoint.sh"]
CMD ["webserver"]

2

Answers


  1. Chosen as BEST ANSWER

    Adding the line below to the Dockerfile did the trick.

    RUN yum install -y gcc python3-devel


  2. To resolve the error, I had to run this on openSUSE Leap 15.3:

    sudo zypper install -t pattern devel_basis
    

    Which is equivalent to running this on Ubuntu:

    sudo apt-get install build-essential
    

    https://stackoverflow.com/a/58680740/3405291

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search