tg3 RX packet re-order in queue 0 with RSS

* tg3 RX packet re-order in queue 0 with RSS
@ 2021-09-20 13:29 Vitaly Bursov
  2021-09-22  6:40 ` Siva Reddy Kallam
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-09-20 13:29 UTC (permalink / raw)
  To: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, netdev

Hi,

We found a occassional and random (sometimes happens, sometimes not)
packet re-order when NIC is involved in UDP multicast reception, which
is sensitive to a packet re-order. Network capture with tcpdump
sometimes shows the packet re-order, sometimes not (e.g. no re-order on
a host, re-order in a container at the same time). In a pcap file
re-ordered packets have a correct timestamp - delayed packet had a more
earlier timestamp compared to a previous packet:
     1.00s packet1
     1.20s packet3
     1.10s packet2
     1.30s packet4

There's about 300Mbps of traffic on this NIC, and server is busy
(hyper-threading enabled, about 50% overall idle) with its
computational application work.

NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
coalescing configuration, 1 TX queue, 4 RX queues.

After further investigation, I believe that there are two separate
issues in tg3.c driver. Issues can be reproduced with iperf3, and
unicast UDP.

Here are the details of how I understand this behavior.

1. Packet re-order.

Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
handles RX queue 0 too:

     https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007

     static int tg3_rx(struct tg3_napi *tnapi, int budget)
     {
             struct tg3 *tp = tnapi->tp;

             ...

             /* Refill RX ring(s). */
             if (!tg3_flag(tp, ENABLE_RSS)) {
                     ....
             } else if (work_mask) {
                     ...

                     if (tnapi != &tp->napi[1]) {
                             tp->rx_refill = true;
                             napi_schedule(&tp->napi[1].napi);
                     }
             }
             ...
     }

 From napi_schedule() code, it should schedure RX 0 traffic handling on
a current CPU, which handles queues RX1-3 right now.

At least two traffic flows are required - one on RX queue 0, and the
other on any other queue (1-3). Re-ordering may happend only on flow
from queue 0, the second flow will work fine.

No idea how to fix this.

There are two ways to mitigate this:

   1. Enable RPS by writting any non-zero mask to
      /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
      when processing traffic, and overrides whatever "current" CPU for
      RX queue 0 is in this moment.

   2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
      weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.

2. RPS configuration

Before napi_gro_receive() call, there's no call to skb_record_rx_queue():

     static int tg3_rx(struct tg3_napi *tnapi, int budget)
     {
             struct tg3 *tp = tnapi->tp;
             u32 work_mask, rx_std_posted = 0;
             u32 std_prod_idx, jmb_prod_idx;
             u32 sw_idx = tnapi->rx_rcb_ptr;
             u16 hw_idx;
             int received;
             struct tg3_rx_prodring_set *tpr = &tnapi->prodring;

             ...

                     napi_gro_receive(&tnapi->napi, skb);

                     received++;
                     budget--;
             ...

As a result, queue_mapping is always 0/not set, and RPS handles all
traffic as originating from queue 0.

           <idle>-0     [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...

RPS configuration for rx-1 to to rx-3 has no effect.

NIC:
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
     Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
     Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
     Latency: 0, Cache Line Size: 64 bytes
     Interrupt: pin A routed to IRQ 16
     NUMA node: 0
     Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
     Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
     Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
     [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
     Capabilities: <access denied>
     Kernel driver in use: tg3
     Kernel modules: tg3

Linux kernel:
     CentOS 7 - 3.10.0-1160.15.2
     Ubuntu - 5.4.0-80.90

Network configuration:
     iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)

     brctl addbr br0
     ip l set up dev br0
     ip a a 10.10.10.10/24 dev br0
     ip r a default via 10.10.10.1 dev br0
     ip l set dev enp2s0f0 master br0
     ip l set up dev enp2s0f0

     ip netns add n1
     ip link add v1 type veth peer name v2
     ip l set up dev v1
     ip l set dev v1 master br0
     ip l set dev v2 netns n1

     ip netns exec n1 bash
     ip l set up dev lo
     ip l set up dev v2
     ip a a 10.10.10.11/24 dev v2

     "receiver 2" has the same configuration but different IP and different namespace.

Iperf3:

     Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
     Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300

-- 
Thanks
Vitalii

^ permalink raw reply	[flat|nested] 13+ messages in thread