Description
I created a sample to study the TLB access/misses statistics. Sample writes 1
to every 4096-th element of the array. Array has 10'000 * 4096
bytes. I expect to see 10'000
TLB stores only, but generated assembly loads beginning of the array every iteration, resulting in 10'000
TLB loads in addition to stores. -O3
optimization is applied
When I looked into the assembly, I noticed that the for-loop looks like that:
- move beginning of the array to the register
- set to 1 a shifted beginning of the array
- increase index
- jump to step 1
Question: Why step 1 is executed every single iteration? Beginning of the array is not changing. I expect the beginning to be loaded once and the jump to be to step 2
C code
(main just calls this test_function
10K times):
#include <stdlib.h>
#define REPEAT 10000
#define PAGESIZE 4096
#define PAGES 10000
char *data = NULL;
inline void test_function()
{
for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) {
data[i] = 1;
}
}
int main()
{
data = (char *) malloc(PAGES * PAGESIZE);
for (int idx = 0; idx < REPEAT; idx++)
test_function();
free(data);
return 0;
}
Generated assembly with gcc and -O3
1090: mov rdx,QWORD PTR [rip+0x2f81] # 4018 <data>
1097: mov BYTE PTR [rdx+rax*1],0x1
109b: add rax,0x1000
10a1: cmp rax,0x2710000
10a7: jne 1090 <main+0x30>
perf stats for 100’000 repetitions
Per function call we can see:
- 10K L1 cache loads, 10K L1 cache stores
- 10K TLB loads, 10K TLB stores
- ~0 TLB load misses, 10K TLB store misses
So load of the array’s beginning is always cached in TLB, but it’s still accessed. Why?
1002500646 L1-dcache-load:u (66.50%)
998346580 L1-dcache-stores:u (66.68%)
995750153 dTLB-loads:u (66.81%)
998020884 dTLB-stores:u (66.74%)
34586 dTLB-loads-misses:u (66.68%)
1003649391 dTLB-stores-misses:u (66.58%)
Platform
Intel(R) Core(TM) i7-10610U CPU
Ubuntu 22.04.3 LTS
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Compilation line: g++ ./tlb.cpp -O3 -g -o gcc.out
2
Answers
The code does not know what
data
holds when the code runs. Therefore, it is conceivable thatdata[i]
overlaysdata
:when this happens, assigning to
data[i]
could modifydata
itself. Thus the compiler cannot simply assume they do not alias and must reloaddata
after each access. Granted, with some more code it could fix this, but the compiler writers have not implemented that.To avoid this problem, make sure to cache global variables in local variables if this sort of problem could occur:
The problem you faced conceptually is very similar to aliasing, but it actually isn’t.
Firstly, let me separate the two:
What actually happens in your case: when compiling this translation unit (.c/.cpp) and seeing a global and non-static variable, the compiler doesn’t have a guarantee that it is not write-accessed outside of this translation unit (since non-static globals have external linkage).
What is the ‘aliasing’?: it is a situation (real or assumed by compiler) when the same memory can be accessed through different symbolic names (or different pointers in case of ‘the pointer aliasing’). Here is an example:
If you define this function on one translation unit and call it in another translation unit, then the compiler will not have guarantees that
out
andin
do not overlap:vecadd(in, in + 1, len - 1);
. So it have to loadin[i]
every time because of this. It will also break vectorization.The difference is that in your case it is one symbolic name, not multiple. However two cases are very similar and ‘treated’ in almost the same way.
Secondly, how can we resolve it?
For your case:
Solution that was mentioned: make a local copy of a variable. It will work in your case, since compiler will know that accessed address will not be changed.
Solution applicable to global variables: use
static
to makedata
have internal linkage. In this case compiler will know all the code sites wheredata
is accessed.Overkill solution: if you use Link-time optimization (
-flto
flag for GCC), compiler will be able to see that the global non-static variable is not accessed outside of the translation unit where it is defined and can optimize out reload on each iteration. Note that for big projects it will greatly increase compilation/linkage time.NB: In all these cases the compiler will likely optimize out all your repeats completely.
For the pointer aliasing in general
The best-practice solution:
restrict
keyword for C or__restrict
compiler builtin for C++.So, for your case becomes
For
vecadd
caseFor more info on
rectrict
: https://en.cppreference.com/w/c/language/restrictOverkill solution: use
-fstrict-aliasing
, it will make GCC assume that you do not have pointer aliasing in a specified translation unit. Will not work in your case.I hope this helps!