skip to Main Content

Introduction:

I have to create a pip wheel of Tensorflow 2.2.0 with cuda libraries dynamically linked(specifically cudart.so). To accomplish this i am currently using the tensorflow-dev docker image.
I am able to build the tf wheel file, an able to install and use it while inside the build container.

Issue:

The issue is that importing the generated wheel file in a CentOS server, i get the following error:

ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home1/private/mavridis/Vineyard/tensorflowshared/test/lib64/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

Having looked around, the issue is caused by the build container using a newer libc:

ldd --version
ldd (Ubuntu GLIBC 2.27-3ubuntu1) 2.27

Compared to CentOS older version:

ldd --version
ldd (GNU libc) 2.17

Expected behavior:

Having already tried the ‘vanilla’ tenorflow 2.2.0 version with no issues, installed using pip:

pip install tensorflow==2.2.0

I expected my own build to also work.

So i assume there is some configuration option or docker configuration to allow me to use the docker built wheel file, in a CentOS setup, just like the pip installed version. As this wheel file is intended to be deployed to setups beyond my control, solutions involving alternate OSes and/or libc replacement are not applicable.

Build configuration:

During build i use the following configuration/ command line:

export TF_NEED_CUDA=1
export TF_USE_XLA=0
export TF_SET_ANDROID_WORKSPACE=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_ROCM=0
bazel build --config=opt --config=cuda --output_filter=DONT_MATCH_ANYTHING --linkopt=-L/usr/local/cuda/lib64 --linkopt=-lcudart --linkopt=-static-libstdc++ //tensorflow/tools/pip_package:build_pip_package

Regarding options used:
–output_filter=DONT_MATCH_ANYTHING : Silence warnings
–linkopt=-L/usr/local/cuda/lib64 –linkopt=-lcudart : Dynamic linking of cudart.so
–linkopt=-static-libstdc++ : Static link libstc++ as libstc++ also caused the libc error, this however is not possible for libm

2

Answers


  1. I expected my own build to also work.

    That expectation is (obviously) incorrect. The symbols your program or library requires from GLIBC depend on exactly which functions you call.

    Consider the following program:

    int main() { exit(0); }
    

    When compiled/linked on a GLIBC-2.30 system, this program only depends on GLIBC_2.2.5 (because it doesn’t call any newer symbols).

    Now change the program slightly:

    int main() { gettid(); exit(0); }
    

    Compile/link it again, and all of a sudden this program now requires GLIBC_2.30 (because that’s where gettid() was added to GLIBC), and will not work on any system which has older GLIBC.

    So i assume there is some configuration option or docker configuration

    Sure: your Docker image must have GLIBC that is not newer than what your target system have, i.e. GLIBC-2.17. Your current image contains GLIBC-2.27 (or newer).

    You need a different Docker image, and you’ll likely have to build it yourself, since GLIBC-2.17 is over 7 years old, and predates TensorFlow by many years.

    Update:

    What i don’t understand is how come the pip tensorflow package (which i assumed was build with the docker image i am using) works with CentOS?

    It works by accident, just like my first program would work on CentOS, but the second one wouldn’t.

    In short i wanted to generate a pip package that would work on ‘any’ linux/libc version

    That is an impossible goal: Linux predates GLIBC, and it is impossible to build a single package that will work on a Linux distribution which didn’t include GLIBC and on a distribution that did.

    You have to draw a line somewhere. The developers of tensorflow-dev docker image drew a line at GLIBC-2.27. Packages built on this image should work on any system with 2.27 or later, and might (but are not at all guaranteed to) work on older systems.

    just like the pip installed version.

    You claim that the pip installed version has no "only GLIBC-xx or later" requirement, but that is not true. I am 99.9% sure that it requires at least GLIBC-2.14.

    To find which GLIBC versions that package requires, run this command:

    readelf -WV _pywrap_tensorflow_internal.so | grep GLIBC_
    

    I assumed, the pip installed version was built using the publicly available tensorflow-devel docker image.

    That is quite likely. And like I said, it happens to work on CentOS, but minute changes may make it not work anymore.

    Update 2:

    So running the readelf command as you suggested, does show the most recent required versions to be: – pip version: GLIBC_2.12 – mine : GLIBC_2.27 So from what i understand the pip version uses an older version even from CentOS, which explains why it works.

    It doesn’t "use" older version, it uses whatever version is available.

    It requires a minimum version 2.12, while your build requires a minimum version 2.27.

    How do they achieve this? Do they use a different image that has an older libc? If so, where can i get it? Or do they use the public image, but build with some bazel flag, that ‘limits’ symbols to the ones contained up to libc 2.12?

    You are still not getting it.

    The version that your program requires depends on exactly which functions you call. In my example program, if I only call exit, my program requires vesion 2.2.5, but if I also call gettid, then my program requires version 2.30. Note: these two programs are built on the same system with the same flags.

    So no: they (most likely) didn’t use a different Docker image, and didn’t use "magic" bazel flags. They just happened to not call any functions which require GLIBC version > 2.12, and you did.

    P.S. You can find which symbol(s) are causing "bad" dependency in your build like so:

    readelf -Ws  _pywrap_tensorflow_internal.so | egrep 'GLIBC_2.2[0-9]'
    readelf -Ws  _pywrap_tensorflow_internal.so | egrep 'GLIBC_2.1[89]'
    

    This would produce output similar to (using my second program):

    readelf -Ws a.out | egrep 'GLIBC_2.[23][0-9]'
         2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND gettid@GLIBC_2.30 (2)
        48: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND gettid@@GLIBC_2.30
    

    The output above shows that the only symbol my binary requires from GLIBC 2.20 or above is gettid.

    Login or Signup to reply.
  2. To make a counter point to what Employed Russian wrote:

    The version that your program requires depends on exactly which functions you call. In my example program, if I only call exit, my program requires vesion 2.2.5, but if I also call gettid, then my program requires version 2.30. Note: these two programs are built on the same system with the same flags.

    I don’t think that’s quite accurate. My understanding, which is corroborated by https://github.com/wheybags/glibc_version_header, is that things work like so (quoting that project, emphasis mine):

    Glibc uses something called symbol versioning. This means that when you use e.g., malloc in your program, the symbol the linker will actually link against is malloc@GLIBC_YOUR_INSTALLED_VERSION (actually, it will link to malloc from the most recent version of glibc that changed the implementaton of malloc, but you get the idea).

    So my guess (I have not checked) would be that the Tensorflow releases are built against an older glibc (perhaps by way of being built on an older release of their target Linux distro).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search