[TL;DR] First, wait for a couple of minutes and check if the Nvidia driver starts to work properly. If not, stop and start the VM instance again.
I created a Deep Learning VM (Google Click to Deploy) with an A100 GPU. After stopping and starting the instance, when I run nvidia-smi
, I got the following error message:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
But if I type which nvidia-smi
, I got
/usr/bin/nvidia-smi
It seems the driver is there but can not be used. Can someone suggest how to enable NVIDIA driver after stopping and starting a deep learning VM? The first time I created and opened the instance, the driver is automatically installed.
The system information is (using uname -m && cat /etc/*release
):
x86_64
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
I tried the installation scripts from GCP. First run
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
And then run
sudo python3 install_gpu_driver.py
which gives the following message:
Executing: which nvidia-smi
/usr/bin/nvidia-smi
Already installed.
2
Answers
After posting the question, the Nvidia driver starts to work properly after waiting for a couple of minutes.
In the following days, I tried stopping/starting the VM instance multiple times. Sometimes
nvidia-smi
directly works, sometimes does not after >20 min waiting. My current best answer to this question is first waiting for several minutes. Ifnvidia-smi
still does not work, stop and start the instance again.What worked for me (not sure if it will go well to next starts) was to remove all drivers:
sudo apt remove --purge '*nvidia*'
, and then force the installation withsudo python3 install_gpu_driver.py
.In the
install_gpu_driver.py
, change line 230 toreturn False
inside of thecheck_driver_installed
function. Then, run the script.Who uses docker may face this error
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
and have to reinstall the docker too. This thread helped me.