skip to Main Content

I am running a simple benchmarking exercise of 2d array access of reading and writing.

I have const int XDim = 1000; and const int YDim = 1000; and create a 1000 x 1000 matrix.

In C-style, I initialize using

double* doubleArray[XDim];
for (int i = 0; i < XDim; i++)
    doubleArray[i] = (double*)malloc(YDim * sizeof(double));
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i][j] = 0.;

In C++ std::vector<std::vector<double>> I use:

std::vector<std::vector<double>> doubleArray;
std::vector<double> vecdbl;
for (int i = 0; i < XDim; i++)
    doubleArray.push_back(vecdbl);
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i].push_back(0.);

In boost style, I use:

typedef boost::multi_array<double, 2> OuterArray;
typedef boost::array<OuterArray::index, 2> OuterArrayIndices;
OuterArray::index rows = XDim;
OuterArray::index cols = YDim;
OuterArrayIndices shape = { rows, cols };
OuterArray doubleArray(shape);
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i][j] = 0.;

Then, I randomly pick 999999 elements from doubleArray[][] and set them to a random integer value chosen in [-5, 5]. Then, I further pick randomly 999999 elements, add them up, then find their avarage. The average is found and the time taken for all of this is displayed. This exercise is repeated many times.

The Godbolt link that does all this is available over here

On the above link, Godbolt seems to only handle const int total_iterations = 2;. Larger values give an error that the time has exceeded. On running this on my computer with const int total_iterations = 100;, a consistent pattern I observe is that the boost method is unable to match the C method of malloc or the C++ method of std::vector<std::vector<double>>. Between the latter two, mostly, the C method beats the C++ method while there are a handful of instances of the reverse. All of this is compiled and run in Visual Studio in Release mode using default settings.

I am unable to see the value of boost::multi_array given this over raw C pointer of pointer or std::vector<std::vector<double>>. Is there a way boost::multi_array can be made faster?


ETA: Based on comments, with sequential access to array elements, and release mode settings on Godbolt, boost::multi_array wins majority of the times. See here 9 out of 10 times.

With nonsequential, random access to elements, and release mode settings on Godbolt, it is close, with C style slightly better than boost::multi_array. See here

Thanks to tstanisl in the comments, for sequential access, C style seems to be faster in 10 out of 10 times. His link is accessible here

There is an overall caveat pointed out by user n-m-could-be-an-ai that this profiling could be profiling the random number generator. Controlling for that seems to be beyond the purview of this question. However, it appears that regardless, since the same number of RNGs are generated for each case, and since the number of iterations is also not 1 or 2, but larger, that could be a wash. I could be wrong.


ETA(2): Based on answer provided by chqrlie, I have modified the code which is available here. I have commented it hopefully sufficiently to see what is going on. It is still not fully clear to me which way is faster, but perhaps someone else can use this code as a starting point to better benchmark these various data structures.

2

Answers


  1. I’m afraid your benchmark has serious flaws that make the results almost meaningless:

    • you compile the C++ code without optimisations. Add -O2 or -O3 in the command option field of the compilation window after uncovering it.
    • you use Godbolt’s infrastructure to run the benchmark: the CPUs in the cloud are wildly overloaded, which may explain the inconsistent results between runs and even from one method to another.
    • it is unclear how efficient the random number generators are compared to the matrix access times: you should use a separate array to pre-compute these values or at least measure the time without array access, summing all such computations to avoid optimisation side effects.

    Here is a modified version with an additional matrix implementation using C99 VLAs, which are supported by gcc: https://godbolt.org/z/v5o8P5M6d

    Try it on a standalone idle computer to get meaningful results.

    Login or Signup to reply.
  2. You wouldn’t expect to see much gains as the access patterns don’t afford much gains.

    However you can separate the scenario building from the perf measurements, and then do that properly, so you get:

    Live On Compiler Explorer

    #define NDEBUG
    #define BOOST_DISABLE_ASSERTS 1
    #include <boost/multi_array.hpp>
    #include <chrono>
    #include <iostream>
    #include <random>
    using namespace std::chrono_literals;
    
    template <typename F, typename... Args>
    auto timed(std::string_view caption, F f, Args const&... args) {
        using C = std::chrono::high_resolution_clock;
        struct Report {
            std::string_view caption;
            C::time_point    s;
            ~Report() { std::cout << caption << ": " << (C::now() - s) / 1.s << "sn"; }
        } r{caption, C::now()};
    
        return f(args...);
    }
    
    static size_t constexpr XDim             = 1'000;
    static size_t constexpr YDim             = 1'000;
    static size_t constexpr total_iterations = 2;
    static size_t constexpr nsamples         = 999'999;
    using Value                              = int;
    
    namespace Generate {
        static std::mt19937                            rng{std::random_device{}()}; // or handpick a seed?
        static std::uniform_int_distribution<Value>    pm5{-5, 5};
        static std::uniform_int_distribution<unsigned> genX{0, XDim - 1}, genY{0, YDim - 1};
        using Value = int;
        struct Scenario {
            struct {
                unsigned x = genX(rng), y = genY(rng);
                Value    val = pm5(rng);
            } data[nsamples];
            struct {
                unsigned x = genX(rng), y = genY(rng);
            } probes[nsamples];
        };
    } // namespace Generate
    
    template <typename Array>
    void RunScenario(Generate::Scenario const& scenario, Array&& arr) {
        for (auto [x, y, val] : scenario.data)
            arr[x][y] = val;
        double sum = 0;
        for (auto [x, y] : scenario.probes)
            sum += arr[x][y];
        printf("Average is %.4ft", sum / nsamples);
    }
    
    static inline void c_style(Generate::Scenario const& scenario) {
        double* arr[XDim];
        for (size_t i = 0; i < XDim; i++)
            arr[i] = (double*)malloc(YDim * sizeof(double));
        for (size_t i = 0; i < XDim; i++)
            for (size_t j = 0; j < YDim; j++)
                arr[i][j] = 0.;
    
        RunScenario(scenario, arr);
        for (size_t i = 0; i < XDim; i++)
            free(arr[i]);
    };
    
    static inline void cxx_style(Generate::Scenario const& scenario) {
        using Row = std::vector<double>;
        std::vector<Row> arr(XDim, Row(YDim, 0.));
    
        RunScenario(scenario, arr);
    };
    
    static inline void multi_array_style(Generate::Scenario const& scenario) {
        using Array = boost::multi_array<double, 2>;
        static constexpr std::array<Array::index, 2> shape{XDim, YDim};
    
        Array arr(shape);
        RunScenario(scenario, arr);
    };
    
    int main() {
        /*
         * In each iteration, we will populate 999'999 randomly chosen entries of
         * the 1000 x 1000 array of doubles with random integer entries between
         * [-5, 5]
         *
         * Then, we will find the average of 999'999 randomly chosen entries of the
         * 1000 x 1000 array of doubles by summing them up and dividing by 999'999
         */
        for (size_t iteration = 0; iteration < total_iterations; ++iteration) {
            std::cout << "n----Iteration " << iteration << "----n";
    
            auto scenario = timed("New scenario", []{ return std::make_unique<Generate::Scenario>(); });
    
            timed("C          style", c_style,           *scenario);
            timed("C++        style", cxx_style,         *scenario);
            timed("MultiArray style", multi_array_style, *scenario);
        }
    }
    

    The output being something like

    ----Iteration 0----
    New scenario: 0.0425284s
    Average is 0.0018   C          style: 0.0152797s
    Average is 0.0018   C++        style: 0.0143449s
    Average is 0.0018   MultiArray style: 0.0135915s
    
    ----Iteration 1----
    New scenario: 0.0393743s
    Average is -0.0007  C          style: 0.0138404s
    Average is -0.0007  C++        style: 0.0107113s
    Average is -0.0007  MultiArray style: 0.00985549s
    

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search