I am debugging A segfault reported by TSAN in the CI of Boost.Beast.
I strongly believe it to be a false positive, but I don’t know what to look for in order to suppress it.
It seems to me from the stack trace that the code is correctly instrumented.
The code passes all other tests, inclding valgrind, ubsan, etc.
I’m hoping some kind expert may put me out of my misery.
Here is the output:
====== BEGIN OUTPUT ======
beast.http.read
ThreadSanitizer:DEADLYSIGNAL
==132842==ERROR: ThreadSanitizer: SEGV on unknown address 0x7ff5d9cff000 (pc 0x7ff5dceba0d0 bp 0x000000000000 sp 0x7ff5d9c3d910 T132844)
==132842==The signal is caused by a READ memory access.
#0 __sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 1, 20>::Put(__sanitizer::StackTrace, bool*) <null> (libtsan.so.2+0xba0d0)
#1 __tsan::CurrentStackId(__tsan::ThreadState*, unsigned long) <null> (libtsan.so.2+0x8c48f)
#2 __sanitizer::DD::MutexInit(__sanitizer::DDCallback*, __sanitizer::DDMutex*) <null> (libtsan.so.2+0xac534)
#3 __tsan::DDMutexInit(__tsan::ThreadState*, unsigned long, __tsan::SyncVar*) <null> (libtsan.so.2+0x9a3f8)
#4 __tsan::MetaMap::GetSync(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) <null> (libtsan.so.2+0xa85dc)
#5 __tsan_atomic32_fetch_add <null> (libtsan.so.2+0x783e9)
#6 __gnu_cxx::__exchange_and_add(int volatile*, int) /usr/include/c++/12/ext/atomicity.h:66 (read+0x4188f8)
#7 __gnu_cxx::__exchange_and_add_dispatch(int*, int) /usr/include/c++/12/ext/atomicity.h:101 (read+0x4188f8)
#8 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use() /usr/include/c++/12/bits/shared_ptr_base.h:187 (read+0x4188f8)
#9 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/12/bits/shared_ptr_base.h:361 (read+0x40c592)
#10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/12/bits/shared_ptr_base.h:1071 (read+0x418fee)
#11 std::__shared_ptr<boost::asio::detail::strand_executor_service::strand_impl, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/12/bits/shared_ptr_base.h:1524 (read+0x420f83)
#12 std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl>::~shared_ptr() /usr/include/c++/12/bits/shared_ptr.h:175 (read+0x420faf)
#13 boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>::~invoker() <null> (read+0x471f5f)
#14 boost::asio::detail::executor_op<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> const, void>, boost::asio::detail::recycling_allocator<void, boost::asio::detail::thread_info_base::default_tag>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) <null> (read+0x48b0ac)
#15 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) boost/asio/detail/scheduler_operation.hpp:40 (read+0x4fa38e)
#16 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) boost/asio/detail/impl/scheduler.ipp:492 (read+0x4e8835)
#17 boost::asio::detail::scheduler::run(boost::system::error_code&) boost/asio/detail/impl/scheduler.ipp:210 (read+0x4e74fb)
#18 boost::asio::io_context::run() boost/asio/impl/io_context.ipp:63 (read+0x4dc122)
#19 boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}::operator()() const <null> (read+0x412363)
#20 void std::__invoke_impl<void, boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}>(std::__invoke_other, boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}&&) <null> (read+0x4a1f92)
#21 std::__invoke_result<boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}>::type std::__invoke<boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}>(boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}&&) <null> (read+0x49f9dc)
#22 void std::thread::_Invoker<std::tuple<boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) <null> (read+0x49c90a)
#23 std::thread::_Invoker<std::tuple<boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}> >::operator()() <null> (read+0x49941e)
#24 std::thread::_State_impl<std::thread::_Invoker<std::tuple<boost::beast::test::enable_yield_to::enable_yield_to(unsigned long)::{lambda()#1}> > >::_M_run() <null> (read+0x494c2a)
#25 execute_native_thread_routine <null> (libstdc++.so.6+0xdbb72)
#26 __tsan_thread_start_func <null> (libtsan.so.2+0x393ef)
#27 start_thread <null> (libc.so.6+0x8ce2c)
#28 clone3 <null> (libc.so.6+0x1121af)
ThreadSanitizer can not provide additional info.
SUMMARY: ThreadSanitizer: SEGV (/lib64/libtsan.so.2+0xba0d0) in __sanitizer::StackDepotBase<__sanitizer::StackDepotNode, 1, 20>::Put(__sanitizer::StackTrace, bool*)
==132842==ABORTING
The code being tested is latest master
branch.
Command line to reproduce:
$ ./b2 toolset=gcc thread-sanitizer=norecover link=static variant=debug libs/beast/test -q -d+2 -j1
My compiler info:
$ gcc --version
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
My OS is Fedora 36. But we see this happen on Ubuntu as well.
2
Answers
Thanks everyone who chipped in on this. I was was able to get some help from the legendary Chris Kohlhoff, author of the Asio library.
Quoting:
The issue is in the thread sanitiser library itself.
The workaround in this case is to restructure the code so that the initiation of the fiber takes place on the same thread as the one in which it makes progress.
Here's the old (correct but not TSAN-compatible) code:
And here's the code with the workaround:
Using the
b2
line I could repro the SEGV on linux using master (00293a6adb5 from the superproject).I started with a – to me – more convenient setup based on CMake. I modified the CMake to use
thread,undefined
sanitizers instead ofaddress,undefined
forVARIANT=ubasan
.Interestingly, it doesn’t segfault. It does however seem to have a legit TSAN violation in
basic_stream.cpp
, where the effective flags are:Breaking it down for readability:
The reported diagnostic: https://paste.ubuntu.com/p/6SKjmZ9wFT/ (lines truncated for SO):
Given this observation, I thought to see whether excluding the offending TU (
basic_stream.cpp
) from the b2 build removes the SEGV. No such luck.On the contrary, the SEGV manfifest with the following TUs:
buffered_read_stream.cpp
read.cpp
write.cpp
close.cpp
handshake.cpp
ping.cpp
read2.cpp
write.cpp
http_examples.cpp
Dropping these TUs from their respective
test/**/Jamfile
allows all remaining tests to pass TSAN under b2. Now, I did some soul searching and e.g. unique include diving using a script like:Which uncovers a common subset of includes of 854 includes. 40 of the beast headers are in that consistent set, but 208 are asio headers.
Questions from here:
Choosing to address these 3., 1., 2. (optimizing for return-on-effort)
3. Is a recent Asio change involved? [YES]
Doing the
b2
test with only Asio reverted to 1.79.0 (e929e5cf Merge asio from ‘develop’) passes all the tests cleanly.Just to check that no compiler flags were harmed in the process e.g.
buffered_read_stream.cpp
showed the same exact commands:So, now we know that something inside the 208 Asio headers must be involved.
1. Why is SEGV not happening in the CMake build?
Surely, here we should be able to spot difference in compilation flags? Ever-so-slightly redacted to remove spelling differences (left = CMake, right = b2):
I used the process of elimination, figured out that the culprit is
-fno-inline -O0
. Somehow it leads to recursive TSAN errors:Observations without
-fno-inline
:-O2
the symptoms go away-O1
the symptoms go away ifNDEBUG
is defined-O0
the symptom is there regardlessThis suggests that a NDEBUG-sensitive piece of code could be involved. This might be a lead to guide minimization.
Comparing preprocessed sources may highlight specific possible causes. My main suspicions are the revamped
spawned_thread_base
inasio/spawn.hpp
, source-locations and or cancellation slots.As a courtesy, here’s the preprocessed-buffered-stream-reader.zip containing 4 files (
/home/sehe/{with,without}-NDEBUG.i.asio1.{79,80}.0
).