NVIDIA starting from CUDA 11.x should in theory guarantee compatibility of their CUDA Toolkit libraries (typically shipped inside docker containers) and the driver library libcuda.so
(installed on the host). This should be true at least when we stay within all minor versions of CUDA (11.0 to 11.8).
It should be therefore possible to run containers with newer
versions of CUDA on hosts with pre-installed GPU drivers built
for older CUDA versions. This does not work in practice though – CUDA-enabled containers (including the official nvidia/cuda
) fail to run in such scenarios.
Are there any configuration workarounds that would at least enable containers to start (to test if apps have GPU access), if upgrading the driver libraries on the host is not feasible and downgrading the containerized CUDA Toolkit is time consuming and would potentially lower functionality?
2
Answers
According to NVIDIA docs setting this env. variable to
true
(or 1) should disable CUDA version check at startup, and should work within the same major CUDA version (thanks to minor version compatibility):I must warn you however, that this workaround works only superficially, letting your container with mismatched (newer) CUDA Toolkit start (no longer crashing on failing CUDA version check). In my case the workaround helped start the container with 11.8 CUDA Toolkit on a machine with CUDA 11.2 driver libraries. But the workaround will ultimately fail as soon when you try to test some ML algos in the GPU, they will fail to train in the model, printing error messages with various levels of specificity (with LightGBM even apparently "working", but at... 0% GPU utilization, i.e. silently failing). The most specific error message was given by Catboost:
while XGBoost errored with a rather misleading message:
(Both of the above algos start working correctly in the GPU after CUDA Toolkit in the container is downgraded to match CUDA version from the host).
Workarounds such as
NVIDIA_DISABLE_REQUIRE
as recommended by an NVIDIA employee on Github here will ultimately fail (as documented here) to deliver GPU access for your apps. You need to synchronize CUDA versions between the driver libraries (on the host) and CUDA Toolkit (in the container), by either of two things: