Ubuntu - C++ implementation of a simple map slower than equivalent implementation in Java: Code/Benchmark Issue?

JoasCoder
December 15, 2024
65 views
0 votes
2 Answers

The goal of this research is to explore the performance differences between JIT (just-in-time compilation) and AOT (ahead-of-time compilation) strategies and to understand their respective advantages and disadvantages. The intent is not to claim that one language is slower or worse than the other.
In our tests, we observed better results with the HotSpot JVM 23 using JIT compilation (JVMCI and C2). We got slower results with C++ (compiled with Clang 18), GraalVM 23 (compiled with native-image), and HotSpot JVM 23 with the -Xcomp flag (JVMCI and C2). We are seeking to understand why this happens, and if possible, identify ways to improve the performance of the C++ version to match the results of Java’s JIT compilation.

Our benchmark involves comparing a simple and small hash table (map) implementation in Java to an equivalent (line-by-line) implementation in C++. We made every effort to ensure consistency between the two implementations. The goal is not to compare the standard library implementations of hash table (java.util.HashMap and std::unordered_map) as they are of course implemented with non-equivalent source code.

The benchmark creates a hash table with 20,000 buckets and inserts 2,000,000 objects. We insert once, clear the map and then insert again, to take advantage of the Entry object internal object pool of the hash table. In other words, the first time you insert, objects are going to be allocated in the heap. The second time you insert objects are going to be re-used from the internal object pool.

The Bench class, which handles the measurements, should also be equivalent in both languages.

Given these details, does anyone have insights into why the C++ map implementation was slower than the Java map implementation? Could there be something we overlooked or an aspect of the C++ implementation that could be optimized further? Perhaps specific Clang optimization options we should explore?

All source code, which is not much, with scripts to compile and execute together with our results are in the Github of the project.

From the comments below, it looks like a major source of improvement for the C++ code could be custom allocators for the hash table internal Entry objects. It has been mentioned that Java has an easier/faster time allocating objects in the heap than C++, through the new keyword, but that this shortcoming can be addressed in C++ by using custom allocators. Unfortunately my knowledge of C++ is limited and I will have to do some research to understand how to do that. An answer here with such a change/improvement in the C++ code would be great.

It is important to note that on the second put benchmark, the C++ code will not be allocating any Entry in the heap, as the objects will be all available in the internal object pool (from the first put). Therefore I’m not sure why C++ is still slower on the second put benchmark.

FTR, below are our current results for Java and C++:

HotSpotVM JIT: (with Graal JVMCI JIT)

PUT => Avg: 371 ns | Min: 28 ns | 99.9% = [avg: 367 ns, max: 1.743 micros]
PUT => Avg: 613 ns | Min: 27 ns | 99.9% = [avg: 606 ns, max: 2.184 micros]
GET => Avg: 615 ns | Min: 14 ns | 99.9% = [avg: 607 ns, max: 2.549 micros]
DEL => Avg: 662 ns | Min: 18 ns | 99.9% = [avg: 658 ns, max: 2.538 micros]

HotSpotVM JIT: (with C2 JIT)

PUT => Avg: 342 ns | Min: 29 ns | 99.9% = [avg: 338 ns, max: 1.661 micros]
PUT => Avg: 596 ns | Min: 28 ns | 99.9% = [avg: 589 ns, max: 2.161 micros]
GET => Avg: 599 ns | Min: 20 ns | 99.9% = [avg: 592 ns, max: 2.275 micros]
DEL => Avg: 826 ns | Min: 23 ns | 99.9% = [avg: 817 ns, max: 3.420 micros]

C++ LLVM: (clang)

PUT => Avg: 726 ns | Min: 30 ns | 99.9% = [avg: 720 ns, max: 4.097 micros]
PUT => Avg: 857 ns | Min: 18 ns | 99.9% = [avg: 848 ns, max: 2.933 micros]
GET => Avg: 874 ns | Min: 18 ns | 99.9% = [avg: 865 ns, max: 3.010 micros]
DEL => Avg: 875 ns | Min: 19 ns | 99.9% = [avg: 871 ns, max: 2.810 micros]

GraalVM: (native-image)

PUT => Avg: 190 ns | Min: 21 ns | 99.9% = [avg: 183 ns, max: 814 ns]
PUT => Avg: 659 ns | Min: 23 ns | 99.9% = [avg: 656 ns, max: 2.762 micros]
GET => Avg: 399 ns | Min: 21 ns | 99.9% = [avg: 396 ns, max: 2.124 micros]
DEL => Avg: 323 ns | Min: 27 ns | 99.9% = [avg: 321 ns, max: 1.850 micros]

Benchmark Environment:

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS n l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

$ arch
x86_64

$ clang++ --version
Ubuntu clang version 18.1.0 (++20240220094926+390dcd4cbbf5-1~exp1~20240220214944.50)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ java -version
java version "23.0.1" 2024-10-15
Java(TM) SE Runtime Environment Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01)
Java HotSpot(TM) 64-Bit Server VM Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01, mixed mode, sharing)

$ native-image --version
native-image 23.0.1 2024-10-15
GraalVM Runtime Environment Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11-jvmci-b01)
Substrate VM Oracle GraalVM 23.0.1+11.1 (build 23.0.1+11, serial gc, compressed references)

To compile the C++ code:

rm -f target/cpp/int_map_benchmark target/cpp/int_map.o target/cpp/bench.o target/cpp/int_map_benchmark.o

mkdir -p target/cpp

clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map.cpp -o ./target/cpp/int_map.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/bench.cpp -o ./target/cpp/bench.o
clang++ -Ofast -march=native -flto -std=c++17 -I./src/main/c -c ./src/main/c/int_map_benchmark.cpp -o ./target/cpp/int_map_benchmark.o

clang++ -Ofast -march=native -flto -std=c++17 -o ./target/cpp/int_map_benchmark ./target/cpp/int_map.o ./target/cpp/bench.o ./target/cpp/int_map_benchmark.o

To run the C++ code:

#!/bin/bash

WARMUP=${1:-0}
MEASUREMENTS=${2:-2000000}
CAPACITY=${3:-20000}

./target/cpp/int_map_benchmark $WARMUP $MEASUREMENTS $CAPACITY

Below the C++ code:

// =====> bench.hpp
#ifndef BENCH_HPP
#define BENCH_HPP

#include <chrono>
#include <iostream>
#include <limits>
#include <iomanip>
#include <string>
#include <cmath>
#include <map>
#include <sstream>

class Bench {
public:
    Bench(int warmupCount = 0);
    ~Bench();

    void mark();
    void measure();
    bool measure(long long);
    void reset();
    void reset(bool);
    void printResults() const;
    void printResults(bool) const;
    bool isWarmingUp() const;
    int getIterations() const;
    int getMeasurements() const;
    double getAverage() const;

private:
    int warmupCount;
    int measurementCount;
    long long sum;
    long long minTime;
    long long maxTime;
    int size;
    std::map<long long, long long>* results;
    std::chrono::steady_clock::time_point startTime;
    
    static std::string formatWithCommas(long long value);
    static std::pair<double, std::string> formatTime(double nanos);
    static std::string formatPercentage(double perc);
    static double roundToDecimals(double d, int decimals);
    void printPercentiles() const;
    void addPercentile(double perc) const;
    double avg() const;
};

#endif // BENCH_HPP

// =====> bench.cpp
#include "bench.hpp"
using namespace std;

Bench::Bench(int warmupCount)
    : warmupCount(warmupCount),
      measurementCount(0),
      sum(0),
      minTime(numeric_limits<long long>::max()),
      maxTime(numeric_limits<long long>::min()),
      size(0) {

        results = new map<long long, long long>();

}

Bench::~Bench() {
    delete results;
}

void Bench::mark() {
    startTime = chrono::steady_clock::now();
}

void Bench::measure() {
    auto endTime = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<chrono::nanoseconds>(endTime - startTime).count();
    measure(elapsed);
}

bool Bench::measure(long long elapsed) {

    bool isToMeasure = ++measurementCount > warmupCount;

    if (isToMeasure) {
        sum += elapsed;
        if (elapsed < minTime) minTime = elapsed;
        if (elapsed > maxTime) maxTime = elapsed;

        // Increment the frequency of this elapsed time
        auto it = results->find(elapsed);
        if (it == results->end()) {
            results->insert({elapsed, 1});
        } else {
            it->second++;
        }
        size++;
    }
    
    return isToMeasure;
}

int Bench::getIterations() const {
    return measurementCount;
}

int Bench::getMeasurements() const {
    return size;
}

void Bench::reset() {
    reset(false);
}

void Bench::reset(bool repeatWarmup) {
    measurementCount = 0;
    sum = 0;
    if (!repeatWarmup) warmupCount = 0;
    minTime = numeric_limits<long long>::max();
    maxTime = numeric_limits<long long>::min();
    results->clear();
    size = 0;
}

bool Bench::isWarmingUp() const {
    return warmupCount <= measurementCount;
}

double Bench::avg() const {
    const int effectiveCount = measurementCount - warmupCount;
    if (effectiveCount <= 0) {
        return 0;
    }
    const double avg = static_cast<double>(sum) / effectiveCount;
    const double rounded = round(avg * 100.0) / 100.0;
    return rounded;
}

double Bench::getAverage() const {
    return avg();
}   

void Bench::printResults() const {
    printResults(true);
}

void Bench::printResults(bool includePercentiles) const {

    int effectiveCount = measurementCount - warmupCount;

    string effCountStr = formatWithCommas(effectiveCount);
    string warmupStr = formatWithCommas(warmupCount);
    string totalStr = formatWithCommas(measurementCount);

    cout << "Measurements: " << effCountStr
         << " | Warm-Up: " << warmupStr
         << " | Iterations: " << totalStr << endl;
         
    if (effectiveCount > 0) {

        auto [avgVal, avgUnit] = formatTime(avg());
        auto [minVal, minUnit] = formatTime(static_cast<double>(minTime));
        auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTime));
    
        cout << fixed << setprecision(3);
        cout << "Avg Time: " << avgVal << " " << avgUnit << " | "
             << "Min Time: " << minVal << " " << minUnit << " | "
             << "Max Time: " << maxVal << " " << maxUnit << endl;
    
        if (includePercentiles) printPercentiles();
    }
    
    cout << endl;
}

string Bench::formatWithCommas(long long value) {
    string numStr = to_string(value);
    int insertPosition = static_cast<int>(numStr.length()) - 3;
    while (insertPosition > 0) {
        numStr.insert(insertPosition, ",");
        insertPosition -= 3;
    }
    return numStr;
}

pair<double, string> Bench::formatTime(double nanos) {
    if (nanos >= 1'000'000'000.0) {
        double seconds = nanos / 1'000'000'000.0;
        return {roundToDecimals(seconds, 3), seconds > 1 ? "seconds" : "second"};
    } else if (nanos >= 1'000'000.0) {
        double millis = nanos / 1'000'000.0;
        return {roundToDecimals(millis, 3), millis > 1 ? "millis" : "milli"};
    } else if (nanos >= 1000.0) {
        double micros = nanos / 1000.0;
        return {roundToDecimals(micros, 3), micros > 1 ? "micros" : "micro"};
    } else {
        double ns = nanos;
        return {roundToDecimals(ns, 3), ns > 1 ? "nanos" : "nano"};
    }
}

double Bench::roundToDecimals(double d, int decimals) {
    double pow10 = pow(10.0, decimals);
    return round(d * pow10) / pow10;
}

void Bench::printPercentiles() const {

    if (size == 0) return;

    double percentiles[] = {0.75, 0.90, 0.99, 0.999, 0.9999, 0.99999};

    for (double p : percentiles) {
        addPercentile(p);
    }
}

string Bench::formatPercentage(double perc) {
    double p = perc * 100.0;

    ostringstream oss;
    oss << fixed << setprecision(6) << p;

    string s = oss.str();
    // remove trailing zeros
    while (s.back() == '0') {
        s.pop_back();
    }

    // if the last character is now a '.', remove it
    if (s.back() == '.') {
        s.pop_back();
    }

    // Append the '%' sign
    s += "%";

    return s;
}

void Bench::addPercentile(double perc) const {

    if (results->empty()) return;

    long long target = static_cast<long long>(llround(perc * size));
    if (target == 0) return;
    if (target > size) target = size;

    // Iterate through the map to find the top element at position target
    long long iTop = 0;
    long long sumTop = 0;
    long long maxTop = -1;

    for (auto &entry : *results) {
        long long time = entry.first;
        long long count = entry.second;

        for (int i = 0; i < count; i++) {
            iTop++;
            sumTop += time;
            if (iTop == target) {
                maxTop = time;
                goto FOUND;
            }
        }
    }

FOUND:;

    double avgTop = static_cast<double>(sumTop) / iTop;
    auto [avgVal, avgUnit] = formatTime(avgTop);
    auto [maxVal, maxUnit] = formatTime(static_cast<double>(maxTop));

    cout << fixed << setprecision(3);
    cout << formatPercentage(perc) << " = [avg: " << avgVal << " " << avgUnit
         << ", max: " << maxVal << " " << maxUnit << "]n";
}

// =====> int_map.hpp
#ifndef INT_MAP_HPP
#define INT_MAP_HPP

#include <optional>
#include <cstddef>

template <typename E>
class IntMap {

private:

    template <typename T>
    struct Entry {
        int key;
        T* value; // removed unnecessary std::optional (thanks to Paul McKenzie at https://stackoverflow.com/users/3133316/paulmckenzie)
        Entry<T>* next;
    };

    Entry<E>** data;
    int count;
    Entry<E>* head;
    const int capacity;

    Entry<E>* newEntry(int key, const E& value, Entry<E>* next) {
        Entry<E>* newEntry = head;

        if (newEntry != nullptr) {
            head = newEntry->next;
        } else {
            newEntry = new Entry<E>();
        }

        newEntry->key = key;
        newEntry->value = const_cast<E*>(&value);
        newEntry->next = next;

        return newEntry;
    }

    void free(Entry<E>* e) {
        e->value = nullptr;
        e->next = head;
        head = e;
    }

    int toArrayIndex(int key) const {
        return (key & 0x7FFFFFFF) % capacity;
    }

public:

    IntMap(int capacity)
        : capacity(capacity), count(0), head(nullptr) {
        data = new Entry<E>*[capacity];
        for (int i = 0; i < capacity; i++) {
            data[i] = nullptr;
        }
    }

    ~IntMap() {
        clear();

        while (head != nullptr) {
            Entry<E>* temp = head;
            head = head->next;
            delete temp;
        }

        delete[] data;
    }

    int size() const {
        return count;
    }

    E* get(int key) const {
        Entry<E>* e = data[toArrayIndex(key)];
        while (e != nullptr) {
            if (e->key == key) {
                return e->value;
            }
            e = e->next;
        }
        return nullptr;
    }

    E* put(int key, const E& value) {
        int index = toArrayIndex(key);
        Entry<E>* e = data[index];

        while (e != nullptr) {
            if (e->key == key) {
                E* old = e->value;
                e->value = const_cast<E*>(&value);
                return old;
            }
            e = e->next;
        }

        data[index] = newEntry(key, value, data[index]);
        count++;
        return nullptr;
    }

    E* remove(int key) {
        int index = toArrayIndex(key);
        Entry<E>* e = data[index];
        Entry<E>* prev = nullptr;

        while (e != nullptr) {
            if (e->key == key) {
                if (prev != nullptr) {
                    prev->next = e->next;
                } else {
                    data[index] = e->next;
                }

                E* oldValue = e->value;
                free(e);
                count--;
                return oldValue;
            }
            prev = e;
            e = e->next;
        }
        return nullptr;
    }

    void clear() {
        for (int index = capacity - 1; index >= 0; index--) {
            while (data[index] != nullptr) {
                Entry<E>* next = data[index]->next;
                free(data[index]);
                data[index] = next;
            }
        }
        count = 0;
    }
};

#endif // INT_MAP_HPP

// =====> int_map.cpp
#include "int_map.hpp"

// NOOP

// =====> int_map_benchmark.cpp
#include "int_map.hpp"
#include "bench.hpp"

using namespace std;

struct Dummy {};

int main(int argc, char* argv[]) {

    int warmupCount = 1000000;
    int measureCount = 1000000;
    int capacity = 100000;

    if (argc > 1) {
        warmupCount = atoi(argv[1]);
    }
    if (argc > 2) {
        measureCount = atoi(argv[2]);
    }
    if (argc > 3) {
        capacity = atoi(argv[3]);
    }

    cout << "nArguments: warmup=" << warmupCount << " measurements=" << measureCount << " mapCapacity=" << capacity << endl << endl;

    int iterations = warmupCount + measureCount;

    IntMap<Dummy>* map = new IntMap<Dummy>(capacity);
    Bench bench(warmupCount);
    Dummy dummy;

    cout << "Benchmarking put on empty map... (1) => creating new Entry objects" << endl;
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->put(i, dummy);
        bench.measure();
    }
    bench.printResults();
    
    cout << "Benchmarking put after clear()... (2) => hitting the pool of Entry objects" << endl;
    map->clear(); // clear the map, but the entry pool will be there
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->put(i, dummy);
        bench.measure();
    }
    bench.printResults();

    cout << "Benchmarking get..." << endl;
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        volatile auto val = map->get(i); // volatile to prevent optimization out
        bench.measure();
    }
    bench.printResults();

    cout << "Benchmarking remove..." << endl;
    bench.reset(true);
    for (int i = 0; i < iterations; i++) {
        bench.mark();
        map->remove(i);
        bench.measure();
    }
    bench.printResults();

    delete map; 
    return 0;
}

Tags: benchmarking c#java jit

Answers

- viraltaco_
- December 14, 2024 at 9:23 pm
- 0 votes
0
You’re using dynamic (heap) allocation (with new) way too much.
I made a very simple change to your bench.hpp, and bench.cpp files:
I changed the type of the results variable from a heap allocated pointer to the value type (see this commit).
I’ll let you run the benchmarks yourself.

Login or Signup to reply.

- Hack06
- December 15, 2024 at 12:15 am
- 0 votes
0
First of all, your code/project is way too big in my opinion for someone to retry and give a concise answer to your several questions, which may amount to redoing it for you.

Regarding to your questions, it’s not quite clear what’s your issue, but I suppose you’re concerned more about the "average" times that you are observing while running your Java and C++ implementations. In this regards, I would like to draw your attention to the "max" value of your first "PUT", which is way too big for C++, and it requires further and thorough analysis.

Concerning to Java and C++ compilers, to my knowledge Java is much more optimized for performance than any other C++ compiler can do out-of-the box. That being said, it doesn’t mean that C++ can’t beat Java in terms of performance. It certainly can, but it also requires some more care from the developers, for example by adding couple of more const keywords or applying some other syntactic formalities.

Generally, to optimize any software for performance, one shall first understand what are the bottlenecks and which parts are the most time-consuming. Let alone the benchmarking techniques and language-specifics, you should not compare apples with oranges, which you are doing here in my opinion by trying to benchmark map-implementations of Java and C++ in a way which is not just comparing their implementations to be honest but also lot of things around each of those, like for example memory allocation/de-allocation and compiler optimization.

What I want to highlight with this my answer is that you need a considerable analysis to be done in order to find the real reason behind your observation. It may be a simple syntactic sugar missing, or it may be complex memory-management, or maybe your benchmarks are skipping the garbage-collection part from Java? What about the process priorities that are being launched in each of those cases? Are you testing both on the same machine? Finally, as it was already pointed out by others in the comments, custom benchmarking is not so reliable.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – C++ implementation of a simple map slower than equivalent implementation in Java: Code/Benchmark Issue?

Answers