Failed to initialize NVML: Unknown Error in Docker after Few hours

JustinSong
July 11, 2022
358 views
3 votes
3 Answers

I am having interesting and weird issue.

When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can’t use gpus in docker.

When I do nvidia-smi in docker machine. I see this msg

"Failed to initialize NVML: Unknown Error"

However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.

My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?

Answers

- Sandro
- August 21, 2022 at 7:58 pm
- 0 votes
0
I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.

Login or Signup to reply.

I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.

Docker-compose Version:

services:
  gpu_container:
    ...
    healthcheck:
      test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true
      - autoheal.stop.timeout=1
    restart: always
  autoheal:
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

Dockerfile Version:

HEALTHCHECK 
    --label autoheal=true 
    --label autoheal.stop.timeout=1 
    --start-period=60s 
    --interval=20s 
    --timeout=10s   
    --retries=2 
    CMD nvidia-smi || exit 1

with autoheal daemon:

docker run -d 
    --name autoheal 
    --restart=always 
    -e AUTOHEAL_CONTAINER_LABEL=all 
    -v /var/run/docker.sock:/var/run/docker.sock 
    willfarrell/autoheal

- nalsas
- October 13, 2022 at 6:03 am
- 0 votes
0
I had the same weird issue. According to your description, it’s most likely relevant to this issue on nvidia-docker official repo:

https://github.com/NVIDIA/nvidia-docker/issues/1618

I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.

ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.