I am having interesting and weird issue.
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can’t use gpus in docker.
When I do nvidia-smi
in docker machine. I see this msg
"Failed to initialize NVML: Unknown Error"
However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.
My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?
3
Answers
I had the same issue, I just ran
screen watch -n 1 nvidia-smi
in the container and now it works continuously.I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.
Docker-compose Version:
Dockerfile Version:
with autoheal daemon:
I had the same weird issue. According to your description, it’s most likely relevant to this issue on nvidia-docker official repo:
https://github.com/NVIDIA/nvidia-docker/issues/1618
I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.
ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.