My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.
2
Answers
It really depends on the kind of round trip you want to measure. Consider the following timestamps:
Do you want to measure
t1' - t1
? Then it’s just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See alsorte_rdtsc_precise()
andrte_get_tsc_hz()
for converting the TSC deltas to nanoseconds.For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call
clock_gettime(CLOCK_REALTIME)
which has an overhead of 18 ns or so.This works for single packet transmits via
rte_eth_tx_burst()
and single packet receives – which aren’t necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.Timestamps
t2, t3, t2', t3'
are hardware transmit/receive timestamps provided by (more serious) NICs.If you want to compute the roundtrip
t2' - t2
then you first need to discipline the NIC’s clock (e.g. withphc2ys
), enable timestamping and get those timestamps. However, AFAICS dpdk doesn’t support obtaining the TX timestamps, in general.Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the
t2' - t2
roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable
Internal Loopback
, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NICAnother way is to use either
PCIe port to port cross connect
. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.Note: Newer Hardware supports
doorbell
mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.But in my recommendation using an external (Machine-B) is not desirable because of
rte_eth_tx_buffer_flush
to ensure the packet is sent out to the NICWith these changes, a dummy UDP packet can be created, where
with these
Round trip Time = (T4 - T1) - (T3 - T2)
, where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks
rte_get_tsc_cycles
is not varying per frequency and power states.[Edit-1] as mentioned in comments