I discovered that on my Ubuntu 22 server, attempting to write to a file, often induces around 20ms delay, even when only writing a few bytes.
Here is some basic code that demonstrates the problem:
#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>
#include <thread>
using namespace std;
void go_c() {
FILE *out = fopen("hello.txt", "w");
fputs("hello", out);
fclose(out);
}
void go_cpp () {
ofstream out("hello.txt");
out<<"hello"<<endl;
}
double test(void (*f)()) {
typedef chrono::time_point <chrono::steady_clock> tp;
tp t0 = chrono::steady_clock::now();
f();
tp t1 = chrono::steady_clock::now();
return chrono::duration<double>(t1-t0).count() * 1000; //milliseconds
}
void bench(void (*f)(), const char *txt, int delay_ms) {
filesystem::remove("hello.txt");
for (int i=0;i<5;i++) {
double t = test(f);
cerr<<i<<": "<<txt<<", time = "<<t<<" ms"<<endl;
this_thread::sleep_for(std::chrono::milliseconds(delay_ms));
}
cerr<<endl;
}
int main () {
bench(go_c, "C Write", 0);
bench(go_cpp, "C++ Write", 0);
bench(go_c, "C Write with delay", 2500);
bench(go_cpp, "C++ Write with delay", 2500);
return 0;
}
And here is the output:
ubuntu@captain:~/scratch$ g++ -o write3 write3.cpp -O2 -Wall
ubuntu@captain:~/scratch$ ./write3
0: C Write, time = 0.09978 ms
1: C Write, time = 21.9316 ms
2: C Write, time = 0.185957 ms
3: C Write, time = 0.140212 ms
4: C Write, time = 0.139051 ms
0: C++ Write, time = 0.145766 ms
1: C++ Write, time = 0.091845 ms
2: C++ Write, time = 0.139618 ms
3: C++ Write, time = 0.130834 ms
4: C++ Write, time = 0.132217 ms
0: C Write with delay, time = 0.048674 ms
1: C Write with delay, time = 0.23875 ms
2: C Write with delay, time = 20.8626 ms
3: C Write with delay, time = 8.4307 ms
4: C Write with delay, time = 19.4026 ms
0: C++ Write with delay, time = 17.1555 ms
1: C++ Write with delay, time = 17.5887 ms
2: C++ Write with delay, time = 18.9792 ms
3: C++ Write with delay, time = 25.8653 ms
4: C++ Write with delay, time = 20.7998 ms
It seems more likely to happen if there is a bit of delay between attempts, and also more likely to happen if the file already exists.
(It seems I can improve my server performance by deleting a file before I write to it. This seems illogical.)
The hardware is a 6-core Xeon 2386G with dual NVMe SSDs on soft Raid, running almost nothing else and no other users.
See attached screenshot showing the result of uname -a
and dpkg --list | grep -E "libc6?-(dev|bin)"
3
Answers
Your issue with cpp implementation is likely due to std::endl being not only
n
but also a std::flush which forces the OS to write to the peripheral. However I could not reproduce your issue on my machine so maybe not code related:Should be easy to confirm the delay is in the kernel with
strace
. If so, you can turn to tools likeperf top
andsysdig
to identify the cause.This answer is meant to provide information about the observed behavior and to inspect what is going on.
TL;DR: the problem clearly comes from the active power state of the NVMe device that can be tuned so not to pay a huge wake-up overhead. This overhead is delayed because of asynchronous IO operations and delayed waits done in the Linux kernel. Please read Section "NVMe device power management" (at the end) so to fix this.
Under the hood: basic profiling
First of all, I can partially reproduce effect on my machine (Debian Linux kernel 5.15.0-2 with a 6-core Intel i5-9600KF and a NVMe "Samsung 970 Evo Plus" SSD) in some conditions. More specifically, the latency of ~6 ms is visible only during the second C test (
2: C Write, time =
) when I do not call GCC just before so to generate the binary. All timings are smaller than 1 ms otherwise (generally in the range 0.1~0.4 ms). Updating the kernel to the version 5.18.0-2 reduced the latency to 2.5 ms (+/- 0.5 ms). The timings are sufficiently stable and deterministic so to report such information.A quick analysis using the great Linux perf tool shows that a significant portion of the time is spend in the kernel scheduling-related functions when the latency spike happens (using
perf record
). Hopefully, we can analyze what the scheduler does with perf too. The commandperf sched record ./a.out
can be used to record scheduling information andperf sched timehist -Vg
help you to show the schedule (note that the perf command is sometime suffixed by the version on some systems likeperf_5.4
). Here is an example on my machine:The time and the left is in seconds, the
0123456
column show the schedule of the active tasks on the cores.s
means the task is scheduled andi
means it is interrupted. I added|
symbols so to better understand when the tracked process is running (a.out
is the program executing your code). The execution time printed by the program is2.68278 ms
so we are basically searching for a 0.0027 second gap in the timestamps (but I find the idle timestamps unreliable since they appear to indicate the end of the idle time instead of the beginning of the idle time — when the task is suspended).The schedule shows that the process runs for 0.988 ms, then is interrupted for 2.639 ms, then runs again for 0.056 ms, then is again interrupted for 0.018 ms and runs again, etc. (the rest of the execution is not shown for sake of clarity). The first interruption match very well with the reported spike timing (especially since we should include the time for the process to warm up again and the scheduler to do the context switch).
During the interruption of the program, two tasks are awaken: a kernel thread called
kworker/u12:2-e
and a task calledgmain
which is certainly gnome-shell. The kernel thread starts when the program is interrupted and is interrupted when the program is resumed (with a 7 us delay). Also please note that the kernel thread takes2.539
to run.Perf can provide some information about kernel calls. You do that by adding the option
--kernel-callchains --call-graph dwarf
tosched record
. Unfortunately, the result are not very useful in this case. The only useful information is that the kernel functionio_schedule <- folio_wait_bit_common <- folio_wait_writeback <- truncate_inode_partial_folio
are called when the program is being interrupted during the spike. This proves the program is interrupted because of an IO operation. You can also add the--wakeups
flag so to see the wakeup event and see that the suspicious slow kernel thread is awaken by the target program (and the previous one are awaken by other tasks (typicallygmain
orgnome-terminal
).strace -T ./a.out
can be used to track the time of system calls and we can clearly see that the third call toopenat
is slow on my machine. Here is the interesting part (reformatted for sake of clarity):Based on the gathered information, we can clearly say that the system calls like
openat
orclose
do not always causes the program to: 1. be interrupted and 2. start a kernel thread doing the actual operation on the SSD. Instead, IO calls appears to be somehow aggregated/cached in RAM, and the completion/synchronization on the SSD is only done at specific moments. The latency spike only happens when a kernel threads does the work and the task is interrupted. My guess is that IO operations are done in RAM (certainly asynchronously) and the kernel sometimes flush/sync in-RAM data to the SSD and this is what takes few milliseconds. The reason for such delay is unclear. Anyway, it means the operation is probably latency bound.In the abysses: kernel profiling
To understand what is exactly going on, one unfortunately need to trace kernel threads and possibly even the SSD driver stack which is a bit tricky. The simplest way to do that seems to use the Linux Function Tracer (aka.
ftrace
). Note that tracing all kernel functions is very expensive and hide the cost of expensive functions so the granularity should be adjusted. The kernel trace are quickly monstrously big and function names are often not very helpful. On top of that tracing kernel threads is not easy because its pid is unknown before the request is made and operations are done in multi-threaded context (and concurrently on each core due to context-switches).I tried it on my machine and here is a simplified profiling trace (with only calls >=1us and no IRQ/fault kernel calls for sake of clarity) of the program:
The kernel trace proves that a context switch happens during a
__x64_sys_openat
(ie. syscall done by thefopen
call) and the latency spike also happens at this time.A deeper tracing show the function responsible for the wait:
Meanwhile, the first call to
openat
callstruncate_inode_pages_range
but nottruncate_inode_partial_folio
so there is no task interruption and no kernel thread to complete the task. In fact, all calls toopenat
on "hellow.txt" causetruncate_inode_pages_range
to be called but only two callstruncate_inode_partial_folio
are made over the 5 first calls tofopen
. This function always callsschedule
in practice but only the first one is slow (subsequent calls take 20-700 us with an average time of 30 us). Thetruncate_pagecache
function tends to confirm that there is a cache, but this does not explain why subsequent calls toschedule
are faster.When tracing the kernel threads, I ended up with traces like:
This basically shows that the most important part (>95%) is missing profiling traces. Unfortunately, tracings functions like above (as well as using EBPF tools like
ext4slower-bpfcc
) resulted in very inconsistent timings, mainly because of the way the time is measured (the absolute time is missing). One need to trace IO-based subsystems to understand this further.The event profiling of the lower-level NVMe stack (using the ftrace-based
trace-cmd
tool) shows that the first NVMe request is slow and subsequent ones are fast:The fast that the second
openat
call is slow is probably due to the synchronization (the IO scheduler wait for the previous request to be completed). The most probable reason is that the NVME device enters is sleeping mode when no requests has been sent for a relatively long time and it takes time to wake up.NVMe device power management
The sleeping mode states are called Active Power States. They can be seen using the command
smartctl -a /dev/nvme0
(in thesmartmontools
package):The latency of the S3 and S4 mode is quite big but the consumption is also far smaller than others. This can be controlled using the
nvme
command (in thenvme-cli
package). More specifically theget-feature
andset-feature
sub-commands. You can get more information about this here. In my case, I just wrote 1000 in the file/sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us
so the latency not to be bigger than 1 ms (require root privileges). Note that this file will be reset when the machine reboot.WARNING: note that preventing the SSD to switch to deep sleeping modes can reduce the battery life on notebooks and increase the device temperature. In pathological cases (ie. poor NVMe controller), this could damage the device if it is not properly cooled. That being said, most device are protected against that using a throttling strategy.
Once the power QOS modified, the latency spike is now gone! Et voila! I get the following program output:
Note this explains why the latency is not the same from one machine to another (and also the waiting time needed to enter in sleep mode), and why running GCC just before did not cause a latency spike.