skip to Main Content

I am having interesting and weird issue.

When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can’t use gpus in docker.

When I do nvidia-smi in docker machine. I see this msg

"Failed to initialize NVML: Unknown Error"

However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.

My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?

3

Answers


  1. I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.

    Login or Signup to reply.
  2. I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.

    Docker-compose Version:

    services:
      gpu_container:
        ...
        healthcheck:
          test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
          start_period: 1s
          interval: 20s
          timeout: 5s
          retries: 2
        labels:
          - autoheal=true
          - autoheal.stop.timeout=1
        restart: always
      autoheal:
        image: willfarrell/autoheal
        environment:
          - AUTOHEAL_CONTAINER_LABEL=all
        volumes:
          - /var/run/docker.sock:/var/run/docker.sock
        restart: always
    

    Dockerfile Version:

    HEALTHCHECK 
        --label autoheal=true 
        --label autoheal.stop.timeout=1 
        --start-period=60s 
        --interval=20s 
        --timeout=10s   
        --retries=2 
        CMD nvidia-smi || exit 1
    

    with autoheal daemon:

    docker run -d 
        --name autoheal 
        --restart=always 
        -e AUTOHEAL_CONTAINER_LABEL=all 
        -v /var/run/docker.sock:/var/run/docker.sock 
        willfarrell/autoheal
    
    Login or Signup to reply.
  3. I had the same weird issue. According to your description, it’s most likely relevant to this issue on nvidia-docker official repo:

    https://github.com/NVIDIA/nvidia-docker/issues/1618

    I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.

    ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search