I have to run the python program in Redhat8. So I pull Redhat docker image and write a Dockerfile which is in the following:
FROM redhat/ubi8:latest
RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && mkdir /home/spark && mkdir /home/spark/spark && mkdir /home/spark/ETL && mkdir /usr/lib/java && mkdir /usr/share/oracle
# set environment vars
ENV SPARK_HOME /home/spark/spark
ENV JAVA_HOME /usr/lib/java
# install packages
RUN
echo "nameserver 9.9.9.9" >> /etc/resolv.conf &&
yum install -y rsync && yum install -y wget && yum install -y python3-pip && yum
install -y openssh-server && yum install -y openssh-clients &&
yum install -y unzip && yum install -y python38 && yum install -y nano
# create ssh keys
RUN
echo "nameserver 9.9.9.9" >> /etc/resolv.conf &&
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa &&
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys &&
chmod 0600 ~/.ssh/authorized_keys
# copy ssh config
COPY ssh_config /root/.ssh/config
COPY spark-3.1.2-bin-hadoop3.2.tgz /home/
COPY jdk-8u25-linux-x64.tar.gz /home/
COPY instantclient-basic-linux.x64-19.8.0.0.0dbru.zip /home
COPY etl /home/ETL/
RUN
tar -zxvf /home/spark-3.1.2-bin-hadoop3.2.tgz -C /home/spark && mv -v
/home/spark/spark-3.1.2-bin-hadoop3.2/* $SPARK_HOME && tar -zxvf /home/jdk-8u25-linux-x64.tar.gz -C /home/spark && mv -v /home/spark/jdk1.8.0_25/* $JAVA_HOME && unzip /home/instantclient-basic-linux.x64-19.8.0.0.0dbru.zip -d /home/spark && mv -v /home/spark/instantclient_19_8 /usr/share/oracle && echo "export JAVA_HOME=$JAVA_HOME" >> ~/.bashrc &&
echo "export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:/usr/share/oracle/instantclient_19_8" >> ~/.bashrc && echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/share/oracle/instantclient_19_8" >> ~/.bashrc && echo "PYTHONPATH = $PYTHONPATH:/usr/bin/python3.8" >> ~/.bashrc && echo "alias python=/usr/bin/python3.8" >> ~/.bashrc
#WARNING: Running pip install with root privileges is generally not a good idea. Try `python3.8 -m pip install --user` instead.
# so I have to create a user
RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf
RUN useradd -d /home/spark/myuser myuser
USER myuser
WORKDIR /home/spark/myuser
ENV PATH="/home/spark/myuser/.local/bin:$PATH"
RUN
python3.8 -m pip install --user pandas &&
python3.8 -m pip install --user cx-Oracle &&
python3.8 -m pip install --user persiantools &&
python3.8 -m pip install --user pyspark &&
python3.8 -m pip install --user py4j &&
python3.8 -m pip install --user python-dateutil &&
python3.8 -m pip install --user pytz &&
python3.8 -m pip install --user setuptools &&
python3.8 -m pip install --user six &&
python3.8 -m pip install --user numpy
# copy spark configs
ADD spark-env.sh $SPARK_HOME/conf/
ADD workers $SPARK_HOME/conf/
# expose various ports
EXPOSE 7012 7013 7014 7015 7016 8881 8081 7077
Also, I copy and build the dockerfile with this script:
#/bin/bash
cp /etc/ssh/ssh_config .
cp /opt/spark/conf/spark-env.sh .
cp /opt/spark/conf/workers .
sudo docker build -t my_docker .
echo "Script Finished."
The dockerfile built without any error. Then I make a tar file from the image that made with this command:
sudo docker save my_docker > my_docker.tar
After that I copy the my_docker.tar to the another computer and load it:
sudo docker load < my_docker.tar
sudo docker run -it my_docker
Unfortunately, when I run my program inside docker container, I receive errors about python package like numpy,pyspark,pandas.
File "/home/spark/ETL/test/main.py", line 3, in <module>
import cst_utils as cu
File "/home/spark/ETL/test/cst_utils.py", line 5, in <module>
import group_state as gs
File "/home/spark/ETL/test/group_state.py", line 1, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
I also try to install the python packages in the docker container and then commit the container.But, when I exit from the container and enter again, there is no python package installed.
Would you please guide what is wrong with my way?
Any help is really appreciated.
3
Answers
Problem solved. I changed the Dockerfile. First, I did not define any user. Then, I set PYSPARK_PYTHON, so there was not error about importing any packages. The Dockerfile is like this:
I hope it was useful for others.
Aside from any issues with the Dockerfile setup itself,
In your
spark-env.sh
, set these to make sure it is using the same environment where pip had installed toKeep in mind that SparkSQL Dataframes should really be used instead of
numpy
, and you don’t need topip install pyspark
since it is already part of the downloaded spark package.I played around with your code, removing most stuff that seemed (to me) irrelevant to the problem.
I found that moving
down, after
USER myuser
solved it. Before it I gotpython
not found, andpython3
turned out not to have have numpy either, whereaspython3.8
did. So there was some confusion there, maybe in your full example something happens that obscures this even more.But try to move that statement because
~/.bashrc
is NOT the same when you change user.