I deployed a virtual machine using Deep Learning VM with Tesla A100 GPU, TensorFlow Enterprise 2.5 and CUDA 11.0. But I have no access to GPU/CUDA and get the following error.
E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to
cuInit: CUDA_ERROR_UNKNOWN: unknown error
At the time of deployment, I received this warning:
tensorflow has resource level warnings.
The resource ‘projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210619-debian-10’ is deprecated. A suggested replacement is ‘projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210624-debian-10’.
It is a an already existing image generated by google and many people are using it, but why can’t I access the GPU or CUDA using it?
import tensorflow as tf
2021-07-05 17:05:14.901743: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
tf.__version__
'2.5.0'
print(tf.config.list_physical_devices())
2021-07-05 17:05:44.757638: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-05 17:05:44.840142: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-07-05 17:05:44.840245: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-1-vm
2021-07-05 17:05:44.840258: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-1-vm
2021-07-05 17:05:44.841760: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 450.80.2
2021-07-05 17:05:44.841820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 450.80.2
2021-07-05 17:05:44.841833: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 450.80.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
The following details can help to figure out the problem.
a_k@deeplearning-1-vm:~$ nvidia-smi
Mon Jul 5 17:03:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 56W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
a_k@deeplearning-1-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
a_k@deeplearning-1-vm:~$ cat /usr/local/cuda/version.txt
CUDA Version 11.0.207
2
Answers
The problem is that versions of nvidia driver, cuda and tensorflow on all the pre-built instances provided by google cloud platform are not compatible (tf2.5 requires cuda>=11.2). I solved this problem by reinstalling latest version of CUDA on pre-built instance (tensorflow enterprise 2.5, CUDA 11.0) and now its working even after restarting the instance. Google must update their pre-build vm instances to solve
This discussion helped me to find the solution. In order to reinstall the CUDA, I didn't uninstall anything, just followed exactly these 6 instructions (for debian 10). Although, I have Ubuntu 18.4 but still it worked. It also asks you if you want to uninstall the previous cuda version (yes!).
Now, I have the following
From the fix provided in this Google Cloud Platform public forum we can mitigate the issue by:
Run the following via an SSH session on the affected instance:
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.