skip to Main Content

[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.

I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi, I got the following error message:

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

But if I type which nvidia-smi, I got

/usr/bin/nvidia-smi

It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.

The system information is (using uname -m && cat /etc/*release):

x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

I tried the installation scripts from GCP. First run

curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py

And then run

sudo python3 install_gpu_driver.py

which gives the following message:

Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.

2

Answers


  1. Chosen as BEST ANSWER

    After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.

    In the following days, I tried stopping/starting the VM instance multiple times. Sometimes nvidia-smi directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. If nvidia-smi still does not work, stop and start the instance again.


  2. What worked for me (not sure if it will go well to next starts) was to remove all drivers: sudo apt remove --purge '*nvidia*', and then force the installation with sudo python3 install_gpu_driver.py.

    In the install_gpu_driver.py, change line 230 to return False inside of the check_driver_installed function. Then, run the script.

    Who uses docker may face this error docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] and have to reinstall the docker too. This thread helped me.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search