Tensorflow CUDA_ERROR_UNKNOWN on Google Cloud Platform - Debian

fisakhan
July 5, 2021
168 views
0 votes
2 Answers

I deployed a virtual machine using Deep Learning VM with Tesla A100 GPU, TensorFlow Enterprise 2.5 and CUDA 11.0. But I have no access to GPU/CUDA and get the following error.

E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to
cuInit: CUDA_ERROR_UNKNOWN: unknown error

At the time of deployment, I received this warning:

tensorflow has resource level warnings.
The resource ‘projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210619-debian-10’ is deprecated. A suggested replacement is ‘projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210624-debian-10’.

It is a an already existing image generated by google and many people are using it, but why can’t I access the GPU or CUDA using it?

import tensorflow as tf
2021-07-05 17:05:14.901743: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
tf.__version__
'2.5.0'
print(tf.config.list_physical_devices())
2021-07-05 17:05:44.757638: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-05 17:05:44.840142: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-07-05 17:05:44.840245: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-1-vm
2021-07-05 17:05:44.840258: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-1-vm
2021-07-05 17:05:44.841760: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 450.80.2
2021-07-05 17:05:44.841820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 450.80.2
2021-07-05 17:05:44.841833: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 450.80.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

The following details can help to figure out the problem.

a_k@deeplearning-1-vm:~$ nvidia-smi
Mon Jul  5 17:03:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    56W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
a_k@deeplearning-1-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

a_k@deeplearning-1-vm:~$ cat /usr/local/cuda/version.txt
CUDA Version 11.0.207

Answers

Chosen as BEST ANSWER

The problem is that versions of nvidia driver, cuda and tensorflow on all the pre-built instances provided by google cloud platform are not compatible (tf2.5 requires cuda>=11.2). I solved this problem by reinstalling latest version of CUDA on pre-built instance (tensorflow enterprise 2.5, CUDA 11.0) and now its working even after restarting the instance. Google must update their pre-build vm instances to solve

This discussion helped me to find the solution. In order to reinstall the CUDA, I didn't uninstall anything, just followed exactly these 6 instructions (for debian 10). Although, I have Ubuntu 18.4 but still it worked. It also asks you if you want to uninstall the previous cuda version (yes!).

Now, I have the following

a_k@a100-tfe25-vm:~$ nvidia-smi
Tue Jul  6 09:56:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

a_k@a100-tfe25-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0

a_k@a100-tfe25-vm:~$ python3
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-07-06 09:57:08.277452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>> tf.config.list_physical_devices()
2021-07-06 09:57:30.897584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-06 09:57:31.689883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-06 09:57:31.689997: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-06 09:57:31.696712: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-06 09:57:31.696809: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-06 09:57:31.699051: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-06 09:57:31.699981: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-06 09:57:31.734585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-07-06 09:57:31.735833: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-06 09:57:31.738230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-06 09:57:31.743485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

(Edit)

- Pit
- July 6, 2021 at 1:35 pm
- 0 votes
0
From the fix provided in this Google Cloud Platform public forum we can mitigate the issue by:
- Fix #1: Use the latest DLVM image (M74 or later) in a new VM instance: They have released a fix for the newest DLVM image in M74 so you will no longer be affected by this issue.
- Fix #2 Patch your existing instance running images older than M74:
Run the following via an SSH session on the affected instance:
```
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh

chmod +x /tmp/restart_patch.sh

sudo /tmp/restart_patch.sh

sudo service jupyter restart
```
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Tensorflow CUDA_ERROR_UNKNOWN on Google Cloud Platform – Debian

Answers