When trying to disassemble every instruction of a program as this SO answer suggests, I found out that gdb
will never finish processing after stepping instruction into gettimeofday
. Here is a minimal reproducible example:
In main.c
:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return 0;
}
Compile with gcc main.c
To run: gdb -silent a.out
Reading symbols from a.out...
(gdb) set height 0
(gdb) b main
Breakpoint 1 at 0x1175: file main.c, line 7.
(gdb) set logging file ./log.txt
(gdb) set logging redirect on
(gdb) set logging on
Redirecting output to ./log.txt.
Copying debug output to ./log.txt.
(gdb) r
(gdb) while 1
>si
>end
And log.txt
shows:
Starting program: /home/user/tst-gdb-step/a.out
Breakpoint 1, main () at main.c:7
7 {
0x000055555555517e 7 {
0x0000555555555182 7 {
10 gettimeofday(&tv, NULL);
0x0000555555555188 10 gettimeofday(&tv, NULL);
0x000055555555518d 10 gettimeofday(&tv, NULL);
0x0000555555555190 10 gettimeofday(&tv, NULL);
0x0000555555555070 in gettimeofday@plt ()
0x0000555555555074 in gettimeofday@plt ()
0x00007ffff7fcd690 in gettimeofday ()
0x00007ffff7fcd691 in gettimeofday ()
0x00007ffff7fcd698 in gettimeofday ()
0x00007ffff7fcd69b in gettimeofday ()
0x00007ffff7fcd69d in gettimeofday ()
0x00007ffff7fcd69e in gettimeofday ()
0x00007ffff7fcd6a1 in gettimeofday ()
0x00007ffff7fcd6a7 in gettimeofday ()
0x00007ffff7fcd6aa in gettimeofday ()
0x00007ffff7fcd6ae in gettimeofday ()
0x00007ffff7fcd6b4 in gettimeofday ()
0x00007ffff7fcd6ba in gettimeofday ()
0x00007ffff7fcd6bd in gettimeofday ()
0x00007ffff7fcd6c3 in gettimeofday ()
0x00007ffff7fcd6c6 in gettimeofday ()
0x00007ffff7fcd6c8 in gettimeofday ()
0x00007ffff7fcd6cc in gettimeofday ()
0x00007ffff7fcd6cf in gettimeofday ()
0x00007ffff7fcd6d2 in gettimeofday ()
0x00007ffff7fcd6d8 in gettimeofday ()
0x00007ffff7fcd6df in gettimeofday ()
0x00007ffff7fcd6e6 in gettimeofday ()
0x00007ffff7fcd6ed in gettimeofday ()
0x00007ffff7fcd6f0 in gettimeofday ()
0x00007ffff7fcd6f2 in gettimeofday ()
0x00007ffff7fcd6f5 in gettimeofday ()
0x00007ffff7fcd6f9 in gettimeofday ()
0x00007ffff7fcd6fc in gettimeofday ()
0x00007ffff7fcd702 in gettimeofday ()
0x00007ffff7fcd709 in gettimeofday ()
0x00007ffff7fcd70c in gettimeofday ()
0x00007ffff7fcd70f in gettimeofday ()
0x00007ffff7fcd6a7 in gettimeofday ()
0x00007ffff7fcd6aa in gettimeofday ()
0x00007ffff7fcd6ae in gettimeofday ()
0x00007ffff7fcd6b4 in gettimeofday ()
0x00007ffff7fcd6ba in gettimeofday ()
0x00007ffff7fcd6bd in gettimeofday ()
0x00007ffff7fcd6c3 in gettimeofday ()
0x00007ffff7fcd6c6 in gettimeofday ()
0x00007ffff7fcd6c8 in gettimeofday ()
0x00007ffff7fcd6cc in gettimeofday ()
0x00007ffff7fcd6cf in gettimeofday ()
0x00007ffff7fcd6d2 in gettimeofday ()
0x00007ffff7fcd6d8 in gettimeofday ()
0x00007ffff7fcd6df in gettimeofday ()
...
After 3 whole minutes of exeuction, gdb
writes nearly 500,000 lines to log.txt
(and without any sign to stop), which is obviously not the normal case, since vdso
is meant to be fast. The log also shows an infinite loop.
But if n
instead of si
is used, the program can exit without any problem.
Tools I’m using:
$ uname -a
Linux 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ gdb --version
GNU gdb (Ubuntu 10.2-0ubuntu1~20.04~1) 10.2
I’m wondering why does this happen. Thanks!
2
Answers
Here is the gdb script I use to bypass this problem:
This does not work correctly for all cases (e.g., branching on time-related events) and is obviously slow, but it does the job for me.
This is a guess based on source diving: I believe single stepping through the guts of
gettimeofday
is slowing those guts down so much that they get stuck in an endless loop trying and failing to meet their internal accuracy requirement for the result.The code for
gettimeofday
on x86-64/Linux is inlinux/lib/vdso/gettimeofday.c
. The "guts" I’m talking about are thedo_hres
function. (I’ve linked to kernel 5.15.0 because that’s what you have, but this file doesn’t change often.) Notice the loop in this function. Inside the loop, it’s calculating the time as, essentially, (coarse counter maintained by kernel) + (hardware high-precision clock that overflows often). Every time the kernel updates the coarse counter, it also updates the "sequence number" being read in the loop condition. The designer expected this loop to cycle at most twice — the point is that if you enter the loop at just the wrong moment and read inconsistent data about the coarse counter, thus computing a nonsense time, you’ll retry and get it right the second time around.But stepping through this code instruction by instruction, even in an automated way, makes it so slow that the kernel always changes the coarse counter and the sequence number while the loop body is executing, and so you get stuck in the loop.
(Do not underestimate just how slow it is to step through machine code instruction by instruction, compared to normal execution. You’re adding the time cost of two context switches and several commands in GDB’s not particularly speedy scripting language to every instruction. I wouldn’t be surprised if the context switches and attendant cache and TLB thrashing were enough to trigger this livelock all by themselves.)