skip to Main Content

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:

I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).

One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).

Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.

Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.

2

Answers


  1. It really depends on the kind of round trip you want to measure. Consider the following timestamps:

      -> t1  -> send() -> NIC_A -> t2  --link--> t3  -> NIC_B -> recv() -> t4
    host_A                                                              host_B
      <- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4' 
    

    Do you want to measure t1' - t1? Then it’s just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.

    For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.

    This works for single packet transmits via rte_eth_tx_burst() and single packet receives – which aren’t necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.


    Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.

    If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC’s clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn’t support obtaining the TX timestamps, in general.

    Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.

    Login or Signup to reply.
  2. The ideal way to latency for sending and receiving packets through an interface is setup external Loopback enter image description here device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.

    The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC

    1. supports internal Loopback
    2. and can suppress Loopback error handlers.

    Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.

    Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.

    But in my recommendation using an external (Machine-B) is not desirable because of

    1. Depending upon the quality of the transfer Medium, the latency varies
    2. If machine-B has to be configured to the ideal settings (for almost 0 latency)
    3. Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
    4. Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
    5. If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC

    With these changes, a dummy UDP packet can be created, where

    • first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
    • second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
    • third 8 bytes is added by Machine-B when tx_burst is completed (T3).
    • fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)

    with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.

    Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.

    [Edit-1] as mentioned in comments

    1. @AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
    2. @AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search