skip to Main Content
  • Sometimes I can’t communicate with my Nvidia GPUs inside a docker container when I came back to my workplace from home, even though the previously launched process that utilizes GPUs is running well. The running process (training a neural network via Pytorch) is not affected by the disconnection but I cannot launch a new process.

  • nvidia-smi gives Failed to initialize NVML: Unknown Error and torch.cuda.is_available() returns False likewise.

  • I met two different cases:

    1. nvidia-smi works fine when it is done at the host machine. In this case, the situation can be solved by restarting the docker container via docker stop $MYCONTAINER followed by docker start $MYCONTAINER at the host machine.
    1. nvidia-smi doesn’t work at the host machine nor nvcc --version, throwing Failed to initialize NVML: Driver/library version mismatch and Command 'nvcc' not found, but can be installed with: sudo apt install nvidia-cuda-toolkit error. Strange point is that the current process still runs well. In this case, installing the driver again or rebooting the machine solves the problem.
  • However, these solutions require stopping all current processes. It would be unavailable when I should not stop the current process.

Does somebody has suggestion for solving this situation?

Many thanks.

(sofwares)

  • Docker version: 20.10.14, build a224086
  • OS: Ubuntu 22.04
  • Nvidia driver version: 510.73.05
  • CUDA version: 11.6

(hardwares)

  • Supermicro server
  • Nvidia A5000 * 8

  • (pic1) nvidia-smi not working inside of a docker container, but worked well on the host machine.
    enter image description here

  • (pic2) nvidia-smi works after restarting a docker container, which is the case 1 I mentioned above
    enter image description here


Additionally,

  • Failed to initialize NVML: Unknown Error is reproducible by calling systemctl daemon-reload at the host machine after starting a container.

2

Answers


  1. For the problem of Failed to initialize NVML: Unknown Error and having to restart the container, please see this ticket and post your system/package information there as well:
    https://github.com/NVIDIA/nvidia-docker/issues/1671

    There’s a workaround on the ticket, but it would be good to have others post their configuration to help fix the issue.

    Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run –gpus all –device /dev/nvidia0:/dev/nvidia0 –device /dev/nvidia-modeset:/dev/nvidia-modeset –device /dev/nvidia-uvm:/dev/nvidia-uvm –device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools –device /dev/nvidiactl:/dev/nvinvidiactl –rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

    so sudo apt-get install -y --allow-downgrades containerd.io=1.6.6-1 and sudo apt-mark hold containerd.io to prevent the package from being updated. So do that, edit the config file, and pass all of the /dev/nvidia* devices in to docker run.

    For the Failed to initialize NVML: Driver/library version mismatch issue, that is caused by the drivers updating but you haven’t rebooted yet. If this is a production machine, I would also hold the driver package to stop that from auto-updating as well. You should be able to figure out the package name from something like sudo dpkg --get-selections "*nvidia*"

    Login or Signup to reply.
  2. Need to install appropriate version of NVIDIA drivers,
    recommend drivers could be found through following command.

    ubuntu-drivers devices
    

    Inappropriate versions of drivers might cause multiple issues as mentioned below, even if we might able to forward gpu instance to container, cuda projects might not work.

    After installing docker, we followed this guide to forward GPU instance from host to container.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search