skip to Main Content

Description

I created a sample to study the TLB access/misses statistics. Sample writes 1 to every 4096-th element of the array. Array has 10'000 * 4096 bytes. I expect to see 10'000 TLB stores only, but generated assembly loads beginning of the array every iteration, resulting in 10'000 TLB loads in addition to stores. -O3 optimization is applied

When I looked into the assembly, I noticed that the for-loop looks like that:

  1. move beginning of the array to the register
  2. set to 1 a shifted beginning of the array
  3. increase index
  4. jump to step 1

Question: Why step 1 is executed every single iteration? Beginning of the array is not changing. I expect the beginning to be loaded once and the jump to be to step 2

C code

(main just calls this test_function 10K times):

#include <stdlib.h>

#define REPEAT 10000
#define PAGESIZE 4096
#define PAGES 10000

char *data = NULL;

inline void test_function()
{
    for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) {
        data[i] = 1;
    }
}


int main()
{
    data = (char *) malloc(PAGES * PAGESIZE);
    for (int idx = 0; idx < REPEAT; idx++)
        test_function();
    free(data);
    return 0;
}

Generated assembly with gcc and -O3

    1090:       mov    rdx,QWORD PTR [rip+0x2f81]        # 4018 <data>
    1097:       mov    BYTE PTR [rdx+rax*1],0x1
    109b:       add    rax,0x1000
    10a1:       cmp    rax,0x2710000
    10a7:       jne    1090 <main+0x30>

perf stats for 100’000 repetitions

Per function call we can see:

  • 10K L1 cache loads, 10K L1 cache stores
  • 10K TLB loads, 10K TLB stores
  • ~0 TLB load misses, 10K TLB store misses
    So load of the array’s beginning is always cached in TLB, but it’s still accessed. Why?
1002500646      L1-dcache-load:u          (66.50%)
 998346580      L1-dcache-stores:u        (66.68%)
 995750153      dTLB-loads:u              (66.81%)
 998020884      dTLB-stores:u             (66.74%)
     34586      dTLB-loads-misses:u       (66.68%)
1003649391      dTLB-stores-misses:u      (66.58%)

Platform

Intel(R) Core(TM) i7-10610U CPU

Ubuntu 22.04.3 LTS

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Compilation line: g++ ./tlb.cpp -O3 -g -o gcc.out

2

Answers


  1. The code does not know what data holds when the code runs. Therefore, it is conceivable that data[i] overlays data:

    data = (char *)&data;
    

    when this happens, assigning to data[i] could modify data itself. Thus the compiler cannot simply assume they do not alias and must reload data after each access. Granted, with some more code it could fix this, but the compiler writers have not implemented that.

    To avoid this problem, make sure to cache global variables in local variables if this sort of problem could occur:

    inline void test_function()
    {
        char *d;
    
        d  = data;
        for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) {
            d[i] = 1;
        }
    }
    
    Login or Signup to reply.
  2. The problem you faced conceptually is very similar to aliasing, but it actually isn’t.

    Firstly, let me separate the two:

    • What actually happens in your case: when compiling this translation unit (.c/.cpp) and seeing a global and non-static variable, the compiler doesn’t have a guarantee that it is not write-accessed outside of this translation unit (since non-static globals have external linkage).

    • What is the ‘aliasing’?: it is a situation (real or assumed by compiler) when the same memory can be accessed through different symbolic names (or different pointers in case of ‘the pointer aliasing’). Here is an example:

      
          void vecadd(int* out, int* in, int len) {
              for (int i = 0; i < len; i++) {
                  out[i] += in[i];
              }
          }
      
      

      If you define this function on one translation unit and call it in another translation unit, then the compiler will not have guarantees that out and in do not overlap: vecadd(in, in + 1, len - 1);. So it have to load in[i] every time because of this. It will also break vectorization.

    • The difference is that in your case it is one symbolic name, not multiple. However two cases are very similar and ‘treated’ in almost the same way.

    Secondly, how can we resolve it?

    • For your case:

      1. Solution that was mentioned: make a local copy of a variable. It will work in your case, since compiler will know that accessed address will not be changed.

      2. Solution applicable to global variables: use static to make data have internal linkage. In this case compiler will know all the code sites where data is accessed.

      3. Overkill solution: if you use Link-time optimization (-flto flag for GCC), compiler will be able to see that the global non-static variable is not accessed outside of the translation unit where it is defined and can optimize out reload on each iteration. Note that for big projects it will greatly increase compilation/linkage time.

      NB: In all these cases the compiler will likely optimize out all your repeats completely.

    • For the pointer aliasing in general

      1. The best-practice solution: restrict keyword for C or __restrict compiler builtin for C++.

        So, for your case becomes

            char * restrict data = NULL;
        
                  66,300      L1-dcache-load                    (66.34%)
              99,924,283      L1-dcache-stores                  (66.34%)
                  75,526      dTLB-loads                        (66.64%)
              99,427,300      dTLB-stores                       (67.26%)
                   8,887      dTLB-load-misses                  (67.03%)
              98,743,055      dTLB-stores-misses                (66.40%)
        
            # gcc ./tlb.c -O1 -g -o tlb.bin
            117e:       48 89 05 93 2e 00 00    mov    QWORD PTR [rip+0x2e93],rax
            1185:       b9 10 27 00 00          mov    ecx,0x2710
            118a:       48 8d 90 00 00 71 02    lea    rdx,[rax+0x2710000]
            1191:       48 89 f8                mov    rax,rdi
            1194:       c6 00 01                mov    BYTE PTR [rax],0x1 <---
            1197:       48 05 00 10 00 00       add    rax,0x1000             |
            119d:       48 39 d0                cmp    rax,rdx                |
            11a0:       75 f2                   jne    1194 <main+0x2b>   ----
        

        For vecadd case

        
            void vecadd(int* restrict out, int* restrict in, int len) {
                for (int i = 0; i < len; i++) {
                    out[i] += in[i];
                }
            }
        
        

        For more info on rectrict: https://en.cppreference.com/w/c/language/restrict

      2. Overkill solution: use -fstrict-aliasing, it will make GCC assume that you do not have pointer aliasing in a specified translation unit. Will not work in your case.

    I hope this helps!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search