skip to Main Content

Using RunConfiguration() class, I used the following way to pass my custom Dockerfile for setting up the environment for the python script

rc = RunConfiguration()
#rc.environment.use_docker = True
rc.docker = DockerConfiguration(use_docker=True)
rc.environment.from_dockerfile("webscraping_env", "./Dockerfile")

I can see in the config file of my rc that docker section is:

"docker": {
        "arguments": [],
        "baseDockerfile": "FROM python:3.8nnRUN apt-get update nRUN apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo2 libcups2 libfontconfig1 libgdk-pixbuf2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libxss1 fonts-liberation libappindicator1 libnss3 lsb-release xdg-utilsnn#download and install chromenRUN wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debnRUN dpkg -i google-chrome-stable_current_amd64.deb; apt-get -fy installnn# install chromedrivernRUN apt-get install -yqq unzipnRUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zipnRUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/nnENV DISPLAY=:99nnRUN pip install selenium pandas bs4 lxml",
        "baseImage": null,
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": "2g"
    }

and python section looks like this:

"python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"
            ],
            "dependencies": [
                "python=3.8.13",
                {
                    "pip": [
                        "azureml-defaults"
                    ]
                }
            ],
            "name": "project_environment"
        },
        "condaDependenciesFile": null,
        "interpreterPath": "python",
        "userManagedDependencies": true
    }

And when I submit my pipeline consisting of single step in order to perform webscraping using selenium and bs4:

step = PythonScriptStep(
    script_name="./webscraping-script.py",  
    source_directory=".",
    arguments=["--output_path", webscrape_ouput],
    outputs=[webscrape_ouput],
    compute_target=AmlCompute(ws, "webscrape-nb"),
    runconfig=rc,
    allow_reuse=False)

I get an import error informing that selenium cannot be found inside the webscraping-script.py. And the pipeline run stops.

I suspect that my dockerfile is not being used as an environment for running the script.

My question:
How do I achieve this? I cannot find any arguments for PythonScriptStep to accept an environment directly like when you pass an environment argument to ParallelRunConfig when setting up a ParallelRunStep.

I was expecting that my Dockerfile would be used as an environment for the python script

2

Answers


  1. You can try adding the selenium package to the conda_dependencies of your RunConfiguration object before submitting the pipeline. Here’s an example:

    from azureml.core.runconfig import CondaDependencies
    
    cd = CondaDependencies.create(pip_packages=['selenium', 'beautifulsoup4'])
    rc = RunConfiguration()
    rc.environment.python.conda_dependencies = cd
    
    Login or Signup to reply.
  2. You can register your custom environment from a Dockerfile and then configure RunConfiguration to use this custom environment. What is nice about registering your environment is that you can easily reuse it.

    Here is the python script to create an environment:

    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential
    from azure.ai.ml.entities import Environment, BuildContext
    
    credential = DefaultAzureCredential()
    
    ml_client = MLClient(
        credential=credential,
        subscription_id="<your-subscription-od>",
        resource_group_name="<resource-group-name>",
        workspace_name="<your-workspace-name>",
    )
    
    env_docker_context = Environment(
        build=BuildContext(path=<path-to-your-docker-build-context>),
        name="my-custom-environment",
        description="Environment created from a Docker context.",
    )
    ml_client.environments.create_or_update(env_docker_context)
    

    path=<path-to-your-docker-build-context> should be the path of your folder containing the required files to build your image + your Dockerfile. After running this script, you should be able to see your environment in the Environments tab in AzureML studio.

    Then you can retrieve this environment and set up the rc.environment value:

    # Make sure you import the right modules
    from azureml.core.runconfig import RunConfiguration
    from azureml.core import Environment
    from azureml.pipeline.steps import PythonScriptStep
    
    rc = RunConfiguration()
    env = Environment.get(workspace=ws, name='my-custom-environment', version='1')
    rc.environment = env
    
    step = PythonScriptStep(
        script_name="./webscraping-script.py",  
        source_directory=".",
        arguments=["--output_path", webscrape_ouput],
        outputs=[webscrape_ouput],
        compute_target=AmlCompute(ws, "webscrape-nb"),
        runconfig=rc,
        allow_reuse=False)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search