Introduction:
I have to create a pip wheel of Tensorflow 2.2.0 with cuda libraries dynamically linked(specifically cudart.so). To accomplish this i am currently using the tensorflow-dev docker image.
I am able to build the tf wheel file, an able to install and use it while inside the build container.
Issue:
The issue is that importing the generated wheel file in a CentOS server, i get the following error:
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home1/private/mavridis/Vineyard/tensorflowshared/test/lib64/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
Having looked around, the issue is caused by the build container using a newer libc:
ldd --version
ldd (Ubuntu GLIBC 2.27-3ubuntu1) 2.27
Compared to CentOS older version:
ldd --version
ldd (GNU libc) 2.17
Expected behavior:
Having already tried the ‘vanilla’ tenorflow 2.2.0 version with no issues, installed using pip:
pip install tensorflow==2.2.0
I expected my own build to also work.
So i assume there is some configuration option or docker configuration to allow me to use the docker built wheel file, in a CentOS setup, just like the pip installed version. As this wheel file is intended to be deployed to setups beyond my control, solutions involving alternate OSes and/or libc replacement are not applicable.
Build configuration:
During build i use the following configuration/ command line:
export TF_NEED_CUDA=1
export TF_USE_XLA=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_ROCM=0
bazel build --config=opt --config=cuda --output_filter=DONT_MATCH_ANYTHING --linkopt=-L/usr/local/cuda/lib64 --linkopt=-lcudart --linkopt=-static-libstdc++ //tensorflow/tools/pip_package:build_pip_package
Regarding options used:
–output_filter=DONT_MATCH_ANYTHING : Silence warnings
–linkopt=-L/usr/local/cuda/lib64 –linkopt=-lcudart : Dynamic linking of cudart.so
–linkopt=-static-libstdc++ : Static link libstc++ as libstc++ also caused the libc error, this however is not possible for libm
2
Answers
That expectation is (obviously) incorrect. The symbols your program or library requires from GLIBC depend on exactly which functions you call.
Consider the following program:
When compiled/linked on a GLIBC-2.30 system, this program only depends on
GLIBC_2.2.5
(because it doesn’t call any newer symbols).Now change the program slightly:
Compile/link it again, and all of a sudden this program now requires
GLIBC_2.30
(because that’s wheregettid()
was added to GLIBC), and will not work on any system which has older GLIBC.Sure: your Docker image must have GLIBC that is not newer than what your target system have, i.e. GLIBC-2.17. Your current image contains GLIBC-2.27 (or newer).
You need a different Docker image, and you’ll likely have to build it yourself, since GLIBC-2.17 is over 7 years old, and predates TensorFlow by many years.
Update:
It works by accident, just like my first program would work on CentOS, but the second one wouldn’t.
That is an impossible goal: Linux predates GLIBC, and it is impossible to build a single package that will work on a Linux distribution which didn’t include GLIBC and on a distribution that did.
You have to draw a line somewhere. The developers of tensorflow-dev docker image drew a line at GLIBC-2.27. Packages built on this image should work on any system with 2.27 or later, and might (but are not at all guaranteed to) work on older systems.
You claim that the pip installed version has no "only GLIBC-xx or later" requirement, but that is not true. I am 99.9% sure that it requires at least GLIBC-2.14.
To find which GLIBC versions that package requires, run this command:
That is quite likely. And like I said, it happens to work on CentOS, but minute changes may make it not work anymore.
Update 2:
It doesn’t "use" older version, it uses whatever version is available.
It requires a minimum version 2.12, while your build requires a minimum version 2.27.
You are still not getting it.
The version that your program requires depends on exactly which functions you call. In my example program, if I only call
exit
, my program requires vesion 2.2.5, but if I also callgettid
, then my program requires version 2.30. Note: these two programs are built on the same system with the same flags.So no: they (most likely) didn’t use a different Docker image, and didn’t use "magic" bazel flags. They just happened to not call any functions which require GLIBC version > 2.12, and you did.
P.S. You can find which symbol(s) are causing "bad" dependency in your build like so:
This would produce output similar to (using my second program):
The output above shows that the only symbol my binary requires from GLIBC 2.20 or above is
gettid
.To make a counter point to what Employed Russian wrote:
I don’t think that’s quite accurate. My understanding, which is corroborated by https://github.com/wheybags/glibc_version_header, is that things work like so (quoting that project, emphasis mine):
So my guess (I have not checked) would be that the Tensorflow releases are built against an older glibc (perhaps by way of being built on an older release of their target Linux distro).