skip to Main Content

I have a fairly robust CI test for a C++ library, these tests (around 50) run over the same docker image but in different machines.

In one machine ("A") all the memcheck (valgrind) tests pass (i.e. no memory leaks).
In the other ("B"), all tests produce the same valgrind error below.

51/56 MemCheck #51: combinations.cpp.x ....................***Exception: SegFault  0.14 sec
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Cannot find memory tester output file: /builds/user/boost-multi/build/Testing/Temporary/MemoryChecker.51.log

The machines are very similar, both are intel i7.
The only difference I can think of is that one is:

A. Ubuntu 22.10, Linux 5.19.0-29, docker 20.10.16

and the other:

B. Fedora 37, Linux 6.1.7-200.fc37.x86_64, docker 20.10.23

and perhaps some configuration of docker I don’t know about.

Is there some configuration of docker that might generate the difference? or of the kernel? or some option in valgrind to workaround this problem?

I know for a fact that in real machines (not docker) valgrind doesn’t produce any memory error.

The options I use for valgrind are always -leak-check=yes --num-callers=51 --trace-children=yes --leak-check=full --track-origins=yes --gen-suppressions=all.
Valgrind version in the image is 3.19.0-1 from the debian:testing image.

Note that this isn’t an error reported by valgrind, it is an error within valgrind.
Perhaps after all, the only difference is that Ubuntu version of valgrind is compiled in release mode and the error is just ignored. (<– this doesn’t make sense, valgrind is the same in both cases because the docker image is the same).

I tried removing --num-callers=51 or setting it at 12 (default value), to no avail.

2

Answers


  1. Chosen as BEST ANSWER

    I found a difference between the images and the real machine and a workaround. It has to do with the number of file descriptors. (This was pointed out briefly in one of the threads on valgind bug issues on Mac OS https://bugs.kde.org/show_bug.cgi?id=381815#c0)

    Inside the docker image running in Ubuntu 22.10:

    ulimit -n
    1048576
    

    Inside the docker image running in Fedora 37:

    ulimit -n
    1073741816
    

    (which looks like a ridiculous number or an overflow)

    In the Fedora 37 and the Ubuntu 22.10 real machines:

    ulimit -n
    1024
    

    So, doing this in the CI recipe, "solved" the problem:

    - ulimit -n       # reports current value
    - ulimit -n 1024  # workaround neededed by valgrind in docker running in Fedora 37
    - ctest ... (with memcheck)
    

    I have no idea why this workaround works.

    For reference:

    $ ulimit --help
    ...
          -n    the maximum number of open file descriptors
    

  2. First off, "you are doing it wrong" with your Valgrind arguments. For CI I recommend a two stage approach. Use as many default arguments as possible for the CI run (–trace-children=yes may well be necessary but not the others). If your codebase is leaky then you may need to check for leaks, but if you can maintain a zero leak policy (or only suppressed leaks) then you can tell if there are new leaks from the summary. After your CI detects an issue you can run again with the kitchen sink options to get full information. Your runs will be significantly faster without all those options.

    Back to the question.

    Valgrind is trying to dup() some file (the guest exe, a tempfile or something like that). The system call to do this is failing.

    A billion files is ridiculous.

    Valgrind will try to call prlimit RLIMIT_NOFILE, with a fallback call to rlimit, and a second fallback to setting the limit to 1024.

    To realy see what is going on you need to modify the Valgrind source (m_main.c, setup_file_descriptors, set local show to True). With this change I see

    fd limits: host, before: cur 65535 max 65535
    fd limits: host,  after: cur 65535 max 65535
    fd limits: guest       : cur 65523 max 65523
    

    Otherwise with strace I see

    2049  prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=65535, rlim_max=65535}) = 0
    2049  prlimit64(0, RLIMIT_NOFILE, {rlim_cur=65535, rlim_max=65535}, NULL) = 0
    

    (all the above on RHEL 7.6 amd64)

    EDIT: Note that the above show Valgrind querying and setting the resource limit. If you use ulimit to lower the limit before running Valgrind, then Valgrind will try to honour that limit. Also note that Valgrind reserves a small number (8) of files for its own use.

    EDIT 2: I’ve made a change to Valgrind so that when it fails to make a file descriptor in its reserved range it prints a message suggesting that you set the descriptor limit in your shell before running Valgrind. Since this is a bug in Docker it is not possible to fix it in Valgrind (we can’t tell that the bogus Docker descriptor resource limit really is bogus and even if we could what would we do – a binary search to try to find the real descriptor resource limit?).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search