skip to Main Content

I’ve written this implementation of a double buffer:

// ping_pong_buffer.hpp

#include <vector>
#include <mutex>
#include <condition_variable>

template <typename T>
class ping_pong_buffer {
public:

    using single_buffer_type = std::vector<T>;
    using pointer = typename single_buffer_type::pointer;
    using const_pointer = typename single_buffer_type::const_pointer;

    ping_pong_buffer(std::size_t size)
        : _read_buffer{ size }
        , _read_valid{ false }
        , _write_buffer{ size }
        , _write_valid{ false } {}

    const_pointer get_buffer_read() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return _read_valid; });
        }
        return _read_buffer.data();
    }

    void end_reading() {
        {
            std::lock_guard<std::mutex> lk(_mtx);
            _read_valid = false;
        }
        _cv.notify_one();
    }

    pointer get_buffer_write() {
        _write_valid = true;
        return _write_buffer.data();
    }

    void end_writing() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return !_read_valid; });
            std::swap(_read_buffer, _write_buffer);
            std::swap(_read_valid, _write_valid);
        }
        _cv.notify_one();
    }

private:

    single_buffer_type _read_buffer;
    bool _read_valid;
    single_buffer_type _write_buffer;
    bool _write_valid;
    mutable std::mutex _mtx;
    mutable std::condition_variable _cv;

};

Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:

#include <thread>
#include <iostream>
#include <chrono>

#include "ping_pong_buffer.hpp"

constexpr std::size_t n = 100000;

int main() {

    ping_pong_buffer<std::size_t> ppb(1);

    std::thread producer([&ppb] {
        for (std::size_t i = 0; i < n; ++i) {
            auto p = ppb.get_buffer_write();
            p[0] = i;
            ppb.end_writing();
        }
    });

    const auto t_begin = std::chrono::steady_clock::now();

    for (;;) {
        auto p = ppb.get_buffer_read();
        if (p[0] == n - 1)
            break;
        ppb.end_reading();
    }

    const auto t_end = std::chrono::steady_clock::now();

    producer.join();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << 'n';

    return 0;

}

Environments of the tests are:

  • Linux (Debian Stretch): Intel Xeon E5-2650 v4, GCC: 900 to 1000 ms
    • GCC flags: -O3 -pthread
  • Windows (10): Intel i7 10700K, VS2019: 45 to 55 ms
    • VS2019 flags: /O2

You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.

This huge gap has been found also in other machines and seems to be due to the OS.

Which could be the reason of this surprising difference?

UPDATE:

The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.

  • Linux (Ubuntu 18.04.5): Intel i7 10700K, GCC: 290 to 300 ms
    • GCC flags: -O3 -pthread

If the number of iterations is increased by a factor 10, I get 2900 ms.

3

Answers


  1. A problem as great as this one probably has to do with the respective implementations of locking. A profiler should be able to break down the reasons why the process is being forced to wait. The lock semantics and features are not at all the same between these two OSes.

    Login or Signup to reply.
  2. As Mike Robinson answered, this is likely to do with the different locking implementations on Windows and Linux.
    We could get a quick idea of the overhead of the feature by profiling how often each implementation switches contexts. I can do the Linux profile, curious if anyone else can try to profile on Windows.


    I’m running Ubuntu 18.04 on a Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz CPU

    I compiled with g++ -O3 -pthread -g test.cpp -o ping_pong, and I recorded how context switches with this command: sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong I extracted a report from the perf counts with this command: sudo perf report -n --header --stdio > linux_ping_pong_report.sched

    The report is large, but I’m only interested in this section that shows that about 200,000 context switches were recorded:

    # Total Lost Samples: 0
    #
    # Samples: 198K of event 'sched:sched_switch'
    # Event count (approx.): 198860
    #
    

    I think that indicates really bad performance, since there in the test, there are n=100000 items pushed & popped to the double buffer, so there is a context switch almost every time we call end_reading() or end_writing(), which is what I’d expect from using std::condition_variable.

    Login or Signup to reply.
  3. As answered by @GandhiGandhi for Linux, I ran the same measurements on Windows 10.

    I used Pix to generate a sampling profile for the application ran in x64 Release on MSVC VS 2019. For this sampling profile I filtered out the 2 threads working for contention, shown as:

    enter image description here


    Then, I exported this file as a .wpix file. Since wpix uses SQLite as storage I opened the exported file with an SQLite browser and queried the ReadyThread table which contains a row of data per context switch based on "readying the thread".

    I then ran:

    SELECT COUNT(*) FROM ReadyThread
    

    Which gave me 27332. This means that roughly the Windows run for this snippet of code contained about 27 thousand context switches, compared to the 200 thousand observed by Gandhi’s answer under Linux this is most likely the culprit for the timing differences you’re seeing.

    I ran this under Windows 10 Pro Version 20H2 19042.867 on an AMD Ryzen 9 5950X 16-Core @ 3.40 GHz.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search