We are trying to develop an MLflow pipeline. We have our developing environment in a series of dockers (no local python environment "whatsoever"). This means that we have set up a docker container with MLflow and all requirements necessary to run pipelines. The issue we have is that when we write our MLflow project file we need to use "docker_env" to specify the environment. This figure illustrates what we want to achieve:
MLflow inside the docker needs to access the docker daemon/service so that it can either use the "docker-image" in the MLflow project file or pull it from docker hub. We are aware of the possibility of using "conda_env" in the MLflow project file but wish to avoid this.
Our question is,
Do we need to set some sort of "docker in docker" solution to achieve our goal?
Is it possible to set up the docker container in which MLflow is running so that it can access the "host machine" docker daemon?
I have been all over Google and MLflow’s documentation but I can seem to find anything that can guide us. Thanks a lot in advance for any help or pointers!
2
Answers
I managed to create my pipeline using docker and docker_env in MLflow. It is not necessary to run d-in-d, the "sibling approach" is sufficient. This approach is described here:
https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/
and it is the preferred method to avoid d-in-d.
One needs to be very careful when mounting volumes within the primary and secondary docker environments: all volume mounts happen in the host machine.
In this case, I would like to suggest a simple alternative:
then MLFlow will reuse everything cached in the container base environment each time the project run.