#include <malloc.h>
#include <vector>
#include <iostream>
#include <chrono>
#include <thread>
void test_openmp(const size_t for_n, const size_t vec_n)
{
std::cout << "running openmp" << std::endl;
#pragma omp parallel for
for (size_t i = 0; i < for_n; ++i)
{
std::vector<int> local_v;
for (size_t j = 0; j < vec_n; ++j)
{
local_v.push_back(j);
}
}
std::cout << "finished openmp" << std::endl;
}
int main()
{
std::cout << "sleeping" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(10));
test_openmp(20, 5000000);
malloc_trim(0);
std::cout << "sleeping" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(10));
return 0;
}
It looks like there will be memory leak if using STL or heap memory allocation with openmp multithreading.
On my Ubuntu 22 platform with 8 cores, it leaks ~200MB (observed by htop
) after running test_openmp(20, 5000000);
.
Is this a known issue that I should not allocate any heap memory in multithreading?
How should I correctly use it?
int local_v[vec_n]
(no heap memory) is fine.
#pragma omp parallel for num_threads(1)
is fine.
And it looks like it leaks more with the num_threads used.
2
Answers
You have not shared any evidence to back the claim of
~200MB
of memory leak.I run your program with
valgrind
and it does not report anydefinitely lost
orindirectly lost
. However, I could seepossibly lost
andstill reachable
which is reported inopenmp
library (stack first function was –GOMP_parallel()
).Run your program as it is (with no changes):
Run your program with
#pragma omp parallel for num_threads(1)
:Run your program with
#pragma omp parallel for num_threads(20)
:No
definitely lost
in any of the above case.Dug up a little around bytes leak showing against
possibly lost
andstill reachable
and found an old thread (it says – this is not memory leak): gomp contains small memoryleakThere is no memory leak in your program. What you are observing is a behaviour typical of modern memory allocators where they would not immediately release memory in the arena but rather cache it for future reuse. This is because memory management happens on two layers.
The first layer is the OS layer. When a process needs memory, it asks the OS to map some physical pages in its address space. Those pages could be backed by the contents of a disk file (a.k.a. memory-mapping a file) or by the system swap space (a.k.a. anonymous maps). This is a very expensive operation as it involves:
That is why most memory managers only ask the OS for large chunks of memory and then perform a second level of subdivision of those chunks, which happens entirely in userspace and is therefore much faster. When you allocate memory with
malloc
(or withnew
in C++, which ultimately callsmalloc
), the allocator looks for free space in what is called a memory arena. If there is space, it carves a chunk out of it and gives you back a pointer to it. If there is no space, it asks the OS for a big chunk of memory, adds it to the arena and then carves a chunk out of it. Once you free the memory, its place in the arena is marked as free and can be reused for further allocations.This may seem like a memory leak by design, but the allocator actually relies on the virtual memory mechanism to tell the OS that it does not currently need parts of its mapped memory and the OS is free to remove the backing without unmapping the memory region. You can force that release with
malloc_trim
, which uses themadvise
system call to tell the OS that it doesn’t really need all those pages in the mappings. The memory mapping remains valid and if the process touches that memory again, the OS will bring in new physical memory pages to back it. This is way faster than fully unmapping and then mapping a new memory region. But the thing withmadvise
is that it simply gives a hint to the OS as to how to treat the process memory in case there is a memory pressure by another process. If there is plenty of free physical memory, the OS will simply never reclaim the freed pages. And even if it does, there is a minimum arena size that the allocator will never go below.There is yet another source of apparent memory leaks in multithreaded applications. The memory allocator uses a complex set of in-memory structures to manage the subdivision of the arena, which is why it needs to lock those structures and cannot serve requests from different threads at the same time, making multithreaded allocation ineffective. This is why modern memory allocators (including the one in glibc) use several arenas, usually up to a multiple of the number of CPU cores. With arenas created dynamically and each one having a minimum size, you may observe higher RSS once all memory has been freed in a multithreaded process.