I hear from colleagues that C++ is faster than Java and when looking for top performance, especially for finance applications, that’s the route to go. But my observations differ a bit. Can anyone point out failures on my experiment or add some scientific variables to the discussion?
Note1: I am using -O3 (maximum optimization) and -O2 with the C++ compiler.
Note2: The short and simple complete source codes for each language are included. Feel free to run on your own machine, make changes, draw conclusions and share.
Note3: If you put both source codes side by side in an editor, you will see that their implementations are equivalent.
UPDATE: I’ve tried clang++
and g++
with a variety of optimization options (-O2
, -O3
, -Os
, -march=native
, etc) and they all have produced slower results than Java. I think at this point to make C++ faster I have to dive into the generated assembly code and do some assembly programming. I’m wondering how practical is this approach (assembly programming and assembly debugging) when coding a large real-life application.
What does the benchmark do?
- Create an int array in the heap (not in the stack)
- Start the clock
- Populate the array
- Sort the array with bubble sort
- Stop the clock
Do that 10 million times, discard the first 1 million for warming up and output the average, min and max time.
For C++ I get: (with -O3 and -O2)
$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1202 | Min Time: 1158 | Max Time: 212189
$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1337 | Min Time: 1307 | Max Time: 36650
For Java I get:
$ java -version
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)
$ javac -cp . TimeBubbleSort.java
$ java -cp . TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 837.0 | Min Time: 812 | Max Time: 37196
Full C++ code:
#include <iostream>
#include <limits>
#include <sstream>
using namespace std;
// TO COMPILE: g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
// TO EXECUTE: ./TimeBubbleSort 10000000 1000000 60
long get_nano_ts(timespec* ts) {
clock_gettime(CLOCK_MONOTONIC, ts);
return ts->tv_sec * 1000000000 + ts->tv_nsec;
}
struct mi {
long value;
};
void swapping(int &a, int &b) {
int temp;
temp = a;
a = b;
b = temp;
}
void bubbleSort(int *array, int size) {
for(int i = 0; i < size; i++) {
bool swaps = false;
for(int j = 0; j < size - i - 1; j++) {
if(array[j] > array[j+1]) {
swapping(array[j], array[j+1]);
swaps = true;
}
}
if (!swaps) break;
}
}
void doSomething(int *array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
int main(int argc, char* argv[]) {
int iterations = stoi(argv[1]);
int warmup = stoi(argv[2]);
int arraySize = stoi(argv[3]);
struct timespec ts;
long long x = 0;
long long totalTime = 0;
int minTime = numeric_limits<int>::max();
int maxTime = numeric_limits<int>::min();
int * array = (int*) malloc(arraySize * sizeof(int));
for(int i = 0; i < iterations; i++) {
long start = get_nano_ts(&ts);
doSomething(array, arraySize);
long end = get_nano_ts(&ts);
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = end - start;
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = min(minTime, res);
maxTime = max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
cout << "Value computed: " << x << endl;
stringstream ss;
ss << "Iterations: " << count << " | Avg Time: " << avg;
if (count > 0) {
ss << " | Min Time: " << minTime << " | Max Time: " << maxTime;
}
cout << ss.str() << endl << endl;
free(array);
return 0;
}
Full Java code:
public class TimeBubbleSort {
// javac -cp . TimeBubbleSort.java
// java -cp . TimeBubbleSort 10000000 1000000 60
private static void swapping(int[] array, int x, int y) {
int temp = array[x];
array[x] = array[y];
array[y] = temp;
}
private static void bubbleSort(int[] array, int size) {
for(int i = 0; i < size; i++) {
int swaps = 0; // flag to detect any swap is there or not
for(int j = 0; j < size - i - 1; j++) {
if (array[j] > array[j + 1]) { // when the current item is bigger than next
swapping(array, j, j + 1);
swaps = 1;
}
}
if (swaps == 0) break; // No swap in this pass, so array is sorted
}
}
private final static void doSomething(int[] array, int size) {
for(int z = 0; z < size; z++) {
array[z] = size - z;
}
bubbleSort(array, size);
}
public static void main(String[] args) {
int iterations = Integer.parseInt(args[0]);
int warmup = Integer.parseInt(args[1]);
int arraySize = Integer.parseInt(args[2]);
long x = 0;
long totalTime = 0;
long minTime = Long.MAX_VALUE;
long maxTime = Long.MIN_VALUE;
int[] array = new int[arraySize];
for(int i = 0; i < iterations; i++) {
long start = System.nanoTime();
doSomething(array, arraySize);
long end = System.nanoTime();
for(int j = 0; j < arraySize; j++) {
x += array[j];
}
int res = (int) (end - start);
if (res <= 0) res = 1;
if (i >= warmup) {
totalTime += res;
minTime = Math.min(minTime, res);
maxTime = Math.max(maxTime, res);
}
}
int count = iterations - warmup;
double avg = totalTime / count;
StringBuilder sb = new StringBuilder();
sb.append("Value computed: ").append(x).append("n");
sb.append("Iterations: ").append(count).append(" | Avg Time: ").append(avg);
if (count > 0) {
sb.append(" | Min Time: ").append(minTime).append(" | Max Time: ").append(maxTime);
}
System.out.println(sb.toString() + "n");
}
}
2
Answers
In both cases the array is filled with numbers in descending order. And then bubble sorted so the code should behave differently. But the way you allocate the array is different.
In C++ you
malloc
. WHich just causes the kernel to record that you have requested some memory but doesn’t map any physical memory to the addresses. So once the clock starts and you start initializing the array every page causes a page fault and then the kernel maps a physical page ad the right address.In Java though you allocate and initialize the array to 0. This causes all the physical pages to be mapped and also the memory now is in the cpu cache.
So when you start the clock the initialization of the array is much faster.
But I guess that is what the warmup should take care of.
That said your test method is flawed. The c++ compiler could optimize the whole loop away with the exception of the
get_nano_ts()
calls. So your C++ code would basically beThis would be very close to
minTime = 1; maxTime = 1; totalTime = iterations - warmup;
Why do you count a time of 0 as 1? If the sorting doesn’t even take a nanosecond you should abort because your test case is by far too small for the accuracy of your clock.
Lets discuss the results you measured a bit:
You sort the exact same array with exactly the same numbers 9000000 times. So the algorithm should behave the same every time and on it’s own every single run should take the exact same time. And yet the time you measure differs by more than 2 orders of magnitudes. That is the sort took 200 times longer in some cases than others (40 times for java).
Lets see what happens when I repeat the test multiple times?
Just doing multiple runs shows the max time to change by 50%. At least the Min Time and Avg. Time is relatively stable. So it seems to be that rarely the OS will interrupt the process and shuffle it to a different CPU core causing all the caches to be lost and then execution time sucks.
Lets play with the compiler flags a bit:
Hey, optimizing less and it’s 5 times faster. Note that originally Java was just barely faster than c++. Surely C++ code must now beat the hell out of java.
Lets go even further:
Optimizing for size barely makes a difference in speed, if at all. I would say it’s below the level of noise. Might be just an artefact of different cache alignment of the code or something.
Or lets try clang++:
Reading back over my answer I totally missed pointing out that with gcc frequently
-O3
code is slower than-O2
. For the most part the reason a lot of optimizer options are in-O3
is that they are generally not faster. Otherwise they would be in-O2
. (Excluding any experimental optimization that isn’t considered stable enough yet).Don’t use
-O3
unless you have tested that it is actually benefittial and then be very selective which part of the code you compile with-O3
.Looking at the clang++ output makes me rethink this. Different compiler, different optimizer, different behavior for -Os / -O2 / -O3.
Now the real work begins: What code do the compilers generate that make such a difference? 5 times slower for "gcc -O3" and twice as fast for "clang++ -O3".
For my GCC11, the answer is Bubble sort slower with -O3 than -O2 with GCC The
-O3
slowdown here is a pretty specific anti-optimization that would often help or at least not hurt much, but here it hurts a lot in a Bubble Sort that doesn’t keeparray[j+1]
around in a temporary to be next iteration’sarray[j]
. Instead reloading it from memory as part of a pair of loads that it does with one wide load, creating a store-forwarding stall.Your GCC version doesn’t have that problem, though, only GCC11 and newer. So don’t expect a big speedup; your GCC7
-O3
should already be making asm with no major problems, except for possible things like How can I mitigate the impact of the Intel jcc erratum on gcc? if you’re using a Skylake CPU.(Store and reload of both elements will still create a loop-carried dependency when bubbling an element from one of the array to the other, though, worse than just updating a register for the next iteration.)
Whatever clang is doing is better than GCC’s best version, though, so you can probably get a big speedup with that.
Given how good JIT is to resolve and fire dead-code-elimination I strongly suggest to rewrite both benchmarks using proper benchmark harness tools eg https://github.com/openjdk/jmh for Java side.