skip to Main Content
#include <malloc.h>
#include <vector>
#include <iostream>
#include <chrono>
#include <thread>

void test_openmp(const size_t for_n, const size_t vec_n)
{
    std::cout << "running openmp" << std::endl;
#pragma omp parallel for
    for (size_t i = 0; i < for_n; ++i)
    {
        std::vector<int> local_v;
        for (size_t j = 0; j < vec_n; ++j)
        {
            local_v.push_back(j);
        }
    }
    std::cout << "finished openmp" << std::endl;
}

int main()
{
    std::cout << "sleeping" << std::endl;
    std::this_thread::sleep_for(std::chrono::seconds(10));
    test_openmp(20, 5000000);
    malloc_trim(0);
    std::cout << "sleeping" << std::endl;
    std::this_thread::sleep_for(std::chrono::seconds(10));

    return 0;
}

It looks like there will be memory leak if using STL or heap memory allocation with openmp multithreading.
On my Ubuntu 22 platform with 8 cores, it leaks ~200MB (observed by htop) after running test_openmp(20, 5000000);.
Is this a known issue that I should not allocate any heap memory in multithreading?
How should I correctly use it?

int local_v[vec_n] (no heap memory) is fine.
#pragma omp parallel for num_threads(1) is fine.
And it looks like it leaks more with the num_threads used.

2

Answers


  1. it leaks ~200MB (observed by htop)…

    You have not shared any evidence to back the claim of ~200MB of memory leak.

    I run your program with valgrind and it does not report any definitely lost or indirectly lost. However, I could see possibly lost and still reachable which is reported in openmp library (stack first function was – GOMP_parallel()).

    valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=v_report.txt ./exefile 
    

    Run your program as it is (with no changes):

    ==11651== LEAK SUMMARY:
    ==11651==    definitely lost: 0 bytes in 0 blocks
    ==11651==    indirectly lost: 0 bytes in 0 blocks
    ==11651==      possibly lost: 304 bytes in 1 blocks
    ==11651==    still reachable: 2,000 bytes in 4 blocks
    ==11651==         suppressed: 0 bytes in 0 blocks
    ==11651== 
    

    Run your program with #pragma omp parallel for num_threads(1) :

    ==10640== LEAK SUMMARY:
    ==10640==    definitely lost: 0 bytes in 0 blocks
    ==10640==    indirectly lost: 0 bytes in 0 blocks
    ==10640==      possibly lost: 0 bytes in 0 blocks
    ==10640==    still reachable: 200 bytes in 2 blocks
    ==10640==         suppressed: 0 bytes in 0 blocks
    ==10640== 
    

    Run your program with #pragma omp parallel for num_threads(20):

    ==10070== LEAK SUMMARY:
    ==10070==    definitely lost: 0 bytes in 0 blocks
    ==10070==    indirectly lost: 0 bytes in 0 blocks
    ==10070==      possibly lost: 5,776 bytes in 19 blocks
    ==10070==    still reachable: 6,032 bytes in 4 blocks
    ==10070==         suppressed: 0 bytes in 0 blocks
    ==10070== 
    

    No definitely lost in any of the above case.

    Dug up a little around bytes leak showing against possibly lost and still reachable and found an old thread (it says – this is not memory leak): gomp contains small memoryleak

    Login or Signup to reply.
  2. There is no memory leak in your program. What you are observing is a behaviour typical of modern memory allocators where they would not immediately release memory in the arena but rather cache it for future reuse. This is because memory management happens on two layers.

    The first layer is the OS layer. When a process needs memory, it asks the OS to map some physical pages in its address space. Those pages could be backed by the contents of a disk file (a.k.a. memory-mapping a file) or by the system swap space (a.k.a. anonymous maps). This is a very expensive operation as it involves:

    • making a system call, i.e., switching from the application context to the kernel
    • modifying the process page tables
    • flushing the CPU’s TLB (translation look-aside buffer) cache

    That is why most memory managers only ask the OS for large chunks of memory and then perform a second level of subdivision of those chunks, which happens entirely in userspace and is therefore much faster. When you allocate memory with malloc (or with new in C++, which ultimately calls malloc), the allocator looks for free space in what is called a memory arena. If there is space, it carves a chunk out of it and gives you back a pointer to it. If there is no space, it asks the OS for a big chunk of memory, adds it to the arena and then carves a chunk out of it. Once you free the memory, its place in the arena is marked as free and can be reused for further allocations.

    This may seem like a memory leak by design, but the allocator actually relies on the virtual memory mechanism to tell the OS that it does not currently need parts of its mapped memory and the OS is free to remove the backing without unmapping the memory region. You can force that release with malloc_trim, which uses the madvise system call to tell the OS that it doesn’t really need all those pages in the mappings. The memory mapping remains valid and if the process touches that memory again, the OS will bring in new physical memory pages to back it. This is way faster than fully unmapping and then mapping a new memory region. But the thing with madvise is that it simply gives a hint to the OS as to how to treat the process memory in case there is a memory pressure by another process. If there is plenty of free physical memory, the OS will simply never reclaim the freed pages. And even if it does, there is a minimum arena size that the allocator will never go below.

    There is yet another source of apparent memory leaks in multithreaded applications. The memory allocator uses a complex set of in-memory structures to manage the subdivision of the arena, which is why it needs to lock those structures and cannot serve requests from different threads at the same time, making multithreaded allocation ineffective. This is why modern memory allocators (including the one in glibc) use several arenas, usually up to a multiple of the number of CPU cores. With arenas created dynamically and each one having a minimum size, you may observe higher RSS once all memory has been freed in a multithreaded process.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search