Using RunConfiguration()
class, I used the following way to pass my custom Dockerfile for setting up the environment for the python script
rc = RunConfiguration()
#rc.environment.use_docker = True
rc.docker = DockerConfiguration(use_docker=True)
rc.environment.from_dockerfile("webscraping_env", "./Dockerfile")
I can see in the config file of my rc
that docker section is:
"docker": {
"arguments": [],
"baseDockerfile": "FROM python:3.8nnRUN apt-get update nRUN apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo2 libcups2 libfontconfig1 libgdk-pixbuf2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libxss1 fonts-liberation libappindicator1 libnss3 lsb-release xdg-utilsnn#download and install chromenRUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debnRUN dpkg -i google-chrome-stable_current_amd64.deb; apt-get -fy installnn# install chromedrivernRUN apt-get install -yqq unzipnRUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zipnRUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/nnENV DISPLAY=:99nnRUN pip install selenium pandas bs4 lxml",
"baseImage": null,
"baseImageRegistry": {
"address": null,
"password": null,
"registryIdentity": null,
"username": null
},
"buildContext": null,
"enabled": false,
"platform": {
"architecture": "amd64",
"os": "Linux"
},
"sharedVolumes": true,
"shmSize": "2g"
}
and python section looks like this:
"python": {
"baseCondaEnvironment": null,
"condaDependencies": {
"channels": [
"anaconda",
"conda-forge"
],
"dependencies": [
"python=3.8.13",
{
"pip": [
"azureml-defaults"
]
}
],
"name": "project_environment"
},
"condaDependenciesFile": null,
"interpreterPath": "python",
"userManagedDependencies": true
}
And when I submit my pipeline consisting of single step in order to perform webscraping using selenium and bs4:
step = PythonScriptStep(
script_name="./webscraping-script.py",
source_directory=".",
arguments=["--output_path", webscrape_ouput],
outputs=[webscrape_ouput],
compute_target=AmlCompute(ws, "webscrape-nb"),
runconfig=rc,
allow_reuse=False)
I get an import error informing that selenium cannot be found inside the webscraping-script.py
. And the pipeline run stops.
I suspect that my dockerfile is not being used as an environment for running the script.
My question:
How do I achieve this? I cannot find any arguments for PythonScriptStep
to accept an environment directly like when you pass an environment argument to ParallelRunConfig
when setting up a ParallelRunStep
.
I was expecting that my Dockerfile would be used as an environment for the python script
2
Answers
You can try adding the
selenium
package to theconda_dependencies
of yourRunConfiguration
object before submitting the pipeline. Here’s an example:You can register your custom environment from a Dockerfile and then configure RunConfiguration to use this custom environment. What is nice about registering your environment is that you can easily reuse it.
Here is the python script to create an environment:
path=<path-to-your-docker-build-context>
should be the path of your folder containing the required files to build your image + your Dockerfile. After running this script, you should be able to see your environment in the Environments tab in AzureML studio.Then you can retrieve this environment and set up the
rc.environment
value: