All of lore.kernel.org
 help / color / mirror / Atom feed
* tg3 RX packet re-order in queue 0 with RSS
@ 2021-09-20 13:29 Vitaly Bursov
  2021-09-22  6:40 ` Siva Reddy Kallam
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-09-20 13:29 UTC (permalink / raw)
  To: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, netdev

Hi,

We found a occassional and random (sometimes happens, sometimes not)
packet re-order when NIC is involved in UDP multicast reception, which
is sensitive to a packet re-order. Network capture with tcpdump
sometimes shows the packet re-order, sometimes not (e.g. no re-order on
a host, re-order in a container at the same time). In a pcap file
re-ordered packets have a correct timestamp - delayed packet had a more
earlier timestamp compared to a previous packet:
     1.00s packet1
     1.20s packet3
     1.10s packet2
     1.30s packet4

There's about 300Mbps of traffic on this NIC, and server is busy
(hyper-threading enabled, about 50% overall idle) with its
computational application work.

NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
coalescing configuration, 1 TX queue, 4 RX queues.

After further investigation, I believe that there are two separate
issues in tg3.c driver. Issues can be reproduced with iperf3, and
unicast UDP.

Here are the details of how I understand this behavior.

1. Packet re-order.

Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
handles RX queue 0 too:

     https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007

     static int tg3_rx(struct tg3_napi *tnapi, int budget)
     {
             struct tg3 *tp = tnapi->tp;

             ...

             /* Refill RX ring(s). */
             if (!tg3_flag(tp, ENABLE_RSS)) {
                     ....
             } else if (work_mask) {
                     ...

                     if (tnapi != &tp->napi[1]) {
                             tp->rx_refill = true;
                             napi_schedule(&tp->napi[1].napi);
                     }
             }
             ...
     }

 From napi_schedule() code, it should schedure RX 0 traffic handling on
a current CPU, which handles queues RX1-3 right now.

At least two traffic flows are required - one on RX queue 0, and the
other on any other queue (1-3). Re-ordering may happend only on flow
from queue 0, the second flow will work fine.

No idea how to fix this.

There are two ways to mitigate this:

   1. Enable RPS by writting any non-zero mask to
      /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
      when processing traffic, and overrides whatever "current" CPU for
      RX queue 0 is in this moment.

   2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
      weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.


2. RPS configuration

Before napi_gro_receive() call, there's no call to skb_record_rx_queue():

     static int tg3_rx(struct tg3_napi *tnapi, int budget)
     {
             struct tg3 *tp = tnapi->tp;
             u32 work_mask, rx_std_posted = 0;
             u32 std_prod_idx, jmb_prod_idx;
             u32 sw_idx = tnapi->rx_rcb_ptr;
             u16 hw_idx;
             int received;
             struct tg3_rx_prodring_set *tpr = &tnapi->prodring;

             ...

                     napi_gro_receive(&tnapi->napi, skb);


                     received++;
                     budget--;
             ...


As a result, queue_mapping is always 0/not set, and RPS handles all
traffic as originating from queue 0.

           <idle>-0     [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...

RPS configuration for rx-1 to to rx-3 has no effect.


NIC:
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
     Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
     Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
     Latency: 0, Cache Line Size: 64 bytes
     Interrupt: pin A routed to IRQ 16
     NUMA node: 0
     Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
     Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
     Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
     [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
     Capabilities: <access denied>
     Kernel driver in use: tg3
     Kernel modules: tg3

Linux kernel:
     CentOS 7 - 3.10.0-1160.15.2
     Ubuntu - 5.4.0-80.90

Network configuration:
     iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)

     brctl addbr br0
     ip l set up dev br0
     ip a a 10.10.10.10/24 dev br0
     ip r a default via 10.10.10.1 dev br0
     ip l set dev enp2s0f0 master br0
     ip l set up dev enp2s0f0

     ip netns add n1
     ip link add v1 type veth peer name v2
     ip l set up dev v1
     ip l set dev v1 master br0
     ip l set dev v2 netns n1

     ip netns exec n1 bash
     ip l set up dev lo
     ip l set up dev v2
     ip a a 10.10.10.11/24 dev v2

     "receiver 2" has the same configuration but different IP and different namespace.

Iperf3:

     Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
     Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300

-- 
Thanks
Vitalii


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-09-20 13:29 tg3 RX packet re-order in queue 0 with RSS Vitaly Bursov
@ 2021-09-22  6:40 ` Siva Reddy Kallam
  2021-10-27  9:30   ` Pavan Chebbi
  0 siblings, 1 reply; 13+ messages in thread
From: Siva Reddy Kallam @ 2021-09-22  6:40 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Prashant Sreedharan, Michael Chan, Linux Netdev List, Pavan Chebbi

Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
We will provide our feedback very soon on this.

On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>
> Hi,
>
> We found a occassional and random (sometimes happens, sometimes not)
> packet re-order when NIC is involved in UDP multicast reception, which
> is sensitive to a packet re-order. Network capture with tcpdump
> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> a host, re-order in a container at the same time). In a pcap file
> re-ordered packets have a correct timestamp - delayed packet had a more
> earlier timestamp compared to a previous packet:
>      1.00s packet1
>      1.20s packet3
>      1.10s packet2
>      1.30s packet4
>
> There's about 300Mbps of traffic on this NIC, and server is busy
> (hyper-threading enabled, about 50% overall idle) with its
> computational application work.
>
> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> coalescing configuration, 1 TX queue, 4 RX queues.
>
> After further investigation, I believe that there are two separate
> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> unicast UDP.
>
> Here are the details of how I understand this behavior.
>
> 1. Packet re-order.
>
> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> handles RX queue 0 too:
>
>      https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>
>      static int tg3_rx(struct tg3_napi *tnapi, int budget)
>      {
>              struct tg3 *tp = tnapi->tp;
>
>              ...
>
>              /* Refill RX ring(s). */
>              if (!tg3_flag(tp, ENABLE_RSS)) {
>                      ....
>              } else if (work_mask) {
>                      ...
>
>                      if (tnapi != &tp->napi[1]) {
>                              tp->rx_refill = true;
>                              napi_schedule(&tp->napi[1].napi);
>                      }
>              }
>              ...
>      }
>
>  From napi_schedule() code, it should schedure RX 0 traffic handling on
> a current CPU, which handles queues RX1-3 right now.
>
> At least two traffic flows are required - one on RX queue 0, and the
> other on any other queue (1-3). Re-ordering may happend only on flow
> from queue 0, the second flow will work fine.
>
> No idea how to fix this.
>
> There are two ways to mitigate this:
>
>    1. Enable RPS by writting any non-zero mask to
>       /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
>       when processing traffic, and overrides whatever "current" CPU for
>       RX queue 0 is in this moment.
>
>    2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
>       weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.
>
>
> 2. RPS configuration
>
> Before napi_gro_receive() call, there's no call to skb_record_rx_queue():
>
>      static int tg3_rx(struct tg3_napi *tnapi, int budget)
>      {
>              struct tg3 *tp = tnapi->tp;
>              u32 work_mask, rx_std_posted = 0;
>              u32 std_prod_idx, jmb_prod_idx;
>              u32 sw_idx = tnapi->rx_rcb_ptr;
>              u16 hw_idx;
>              int received;
>              struct tg3_rx_prodring_set *tpr = &tnapi->prodring;
>
>              ...
>
>                      napi_gro_receive(&tnapi->napi, skb);
>
>
>                      received++;
>                      budget--;
>              ...
>
>
> As a result, queue_mapping is always 0/not set, and RPS handles all
> traffic as originating from queue 0.
>
>            <idle>-0     [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...
>
> RPS configuration for rx-1 to to rx-3 has no effect.
>
>
> NIC:
> 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
>      Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
>      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>      Latency: 0, Cache Line Size: 64 bytes
>      Interrupt: pin A routed to IRQ 16
>      NUMA node: 0
>      Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
>      Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
>      Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
>      [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
>      Capabilities: <access denied>
>      Kernel driver in use: tg3
>      Kernel modules: tg3
>
> Linux kernel:
>      CentOS 7 - 3.10.0-1160.15.2
>      Ubuntu - 5.4.0-80.90
>
> Network configuration:
>      iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)
>
>      brctl addbr br0
>      ip l set up dev br0
>      ip a a 10.10.10.10/24 dev br0
>      ip r a default via 10.10.10.1 dev br0
>      ip l set dev enp2s0f0 master br0
>      ip l set up dev enp2s0f0
>
>      ip netns add n1
>      ip link add v1 type veth peer name v2
>      ip l set up dev v1
>      ip l set dev v1 master br0
>      ip l set dev v2 netns n1
>
>      ip netns exec n1 bash
>      ip l set up dev lo
>      ip l set up dev v2
>      ip a a 10.10.10.11/24 dev v2
>
>      "receiver 2" has the same configuration but different IP and different namespace.
>
> Iperf3:
>
>      Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
>      Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300
>
> --
> Thanks
> Vitalii
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-09-22  6:40 ` Siva Reddy Kallam
@ 2021-10-27  9:30   ` Pavan Chebbi
  2021-10-27 10:31     ` Vitaly Bursov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavan Chebbi @ 2021-10-27  9:30 UTC (permalink / raw)
  To: Siva Reddy Kallam
  Cc: Vitaly Bursov, Prashant Sreedharan, Michael Chan, Linux Netdev List

On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
<siva.kallam@broadcom.com> wrote:
>
> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> We will provide our feedback very soon on this.
>
> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >
> > Hi,
> >
> > We found a occassional and random (sometimes happens, sometimes not)
> > packet re-order when NIC is involved in UDP multicast reception, which
> > is sensitive to a packet re-order. Network capture with tcpdump
> > sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> > a host, re-order in a container at the same time). In a pcap file
> > re-ordered packets have a correct timestamp - delayed packet had a more
> > earlier timestamp compared to a previous packet:
> >      1.00s packet1
> >      1.20s packet3
> >      1.10s packet2
> >      1.30s packet4
> >
> > There's about 300Mbps of traffic on this NIC, and server is busy
> > (hyper-threading enabled, about 50% overall idle) with its
> > computational application work.
> >
> > NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> > coalescing configuration, 1 TX queue, 4 RX queues.
> >
> > After further investigation, I believe that there are two separate
> > issues in tg3.c driver. Issues can be reproduced with iperf3, and
> > unicast UDP.
> >
> > Here are the details of how I understand this behavior.
> >
> > 1. Packet re-order.
> >
> > Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> > however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> > handles RX queue 0 too:
> >
> >      https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >
> >      static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >      {
> >              struct tg3 *tp = tnapi->tp;
> >
> >              ...
> >
> >              /* Refill RX ring(s). */
> >              if (!tg3_flag(tp, ENABLE_RSS)) {
> >                      ....
> >              } else if (work_mask) {
> >                      ...
> >
> >                      if (tnapi != &tp->napi[1]) {
> >                              tp->rx_refill = true;
> >                              napi_schedule(&tp->napi[1].napi);
> >                      }
> >              }
> >              ...
> >      }
> >
> >  From napi_schedule() code, it should schedure RX 0 traffic handling on
> > a current CPU, which handles queues RX1-3 right now.
> >
> > At least two traffic flows are required - one on RX queue 0, and the
> > other on any other queue (1-3). Re-ordering may happend only on flow
> > from queue 0, the second flow will work fine.
> >
> > No idea how to fix this.

In the case of RSS the actual rings for RX are from 1 to 4.
The napi of those rings are indeed processing the packets.
The explicit napi_schedule of napi[1] is only re-filling rx BD
producer ring because it is shared with return rings for 1-4.
I tried to repro this but I am not seeing the issue. If you are
receiving packets on RX 0 then the RSS must have been disabled.
Can you please check?


> >
> > There are two ways to mitigate this:
> >
> >    1. Enable RPS by writting any non-zero mask to
> >       /sys/class/net/enp2s0f0/queues/rx-0/rps_cpus This encorces CPU
> >       when processing traffic, and overrides whatever "current" CPU for
> >       RX queue 0 is in this moment.
> >
> >    2. Configure RX hash flow redirection with: ethtool -X enp2s0f0
> >       weight 0 1 1 1 to exclude RX queue 0 from handling the traffic.
> >
> >
> > 2. RPS configuration
> >
> > Before napi_gro_receive() call, there's no call to skb_record_rx_queue():
> >
> >      static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >      {
> >              struct tg3 *tp = tnapi->tp;
> >              u32 work_mask, rx_std_posted = 0;
> >              u32 std_prod_idx, jmb_prod_idx;
> >              u32 sw_idx = tnapi->rx_rcb_ptr;
> >              u16 hw_idx;
> >              int received;
> >              struct tg3_rx_prodring_set *tpr = &tnapi->prodring;
> >
> >              ...
> >
> >                      napi_gro_receive(&tnapi->napi, skb);
> >
> >
> >                      received++;
> >                      budget--;
> >              ...
> >
> >
> > As a result, queue_mapping is always 0/not set, and RPS handles all
> > traffic as originating from queue 0.
> >
> >            <idle>-0     [013] ..s. 14030782.234664: napi_gro_receive_entry: dev=enp2s0f0 napi_id=0x0 queue_mapping=0 ...
> >
> > RPS configuration for rx-1 to to rx-3 has no effect.

OK I think we could add a patch to update skb with queue mapping. I
will discuss it internally.

> >
> >
> > NIC:
> > 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
> >      Subsystem: Hewlett-Packard Company Ethernet 1Gb 4-port 331i Adapter
> >      Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
> >      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> >      Latency: 0, Cache Line Size: 64 bytes
> >      Interrupt: pin A routed to IRQ 16
> >      NUMA node: 0
> >      Region 0: Memory at d9d90000 (64-bit, prefetchable) [size=64K]
> >      Region 2: Memory at d9da0000 (64-bit, prefetchable) [size=64K]
> >      Region 4: Memory at d9db0000 (64-bit, prefetchable) [size=64K]
> >      [virtual] Expansion ROM at d9c00000 [disabled] [size=256K]
> >      Capabilities: <access denied>
> >      Kernel driver in use: tg3
> >      Kernel modules: tg3
> >
> > Linux kernel:
> >      CentOS 7 - 3.10.0-1160.15.2
> >      Ubuntu - 5.4.0-80.90
> >
> > Network configuration:
> >      iperf3 (sender) - [LAN] - NIC (enp2s0f0) - Bridge (br0) - veth (v1) - namespace veth (v2) - iperf3 (receiver 1)
> >
> >      brctl addbr br0
> >      ip l set up dev br0
> >      ip a a 10.10.10.10/24 dev br0
> >      ip r a default via 10.10.10.1 dev br0
> >      ip l set dev enp2s0f0 master br0
> >      ip l set up dev enp2s0f0
> >
> >      ip netns add n1
> >      ip link add v1 type veth peer name v2
> >      ip l set up dev v1
> >      ip l set dev v1 master br0
> >      ip l set dev v2 netns n1
> >
> >      ip netns exec n1 bash
> >      ip l set up dev lo
> >      ip l set up dev v2
> >      ip a a 10.10.10.11/24 dev v2
> >
> >      "receiver 2" has the same configuration but different IP and different namespace.
> >
> > Iperf3:
> >
> >      Sender runs iperfs: iperf3 -s -p 5201 & iperf3 -s -p 5202 &
> >      Receiver's iperf3 -c 10.10.10.1 -R -p 5201 -u -l 163 -b 200M -t 300
> >
> > --
> > Thanks
> > Vitalii
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-27  9:30   ` Pavan Chebbi
@ 2021-10-27 10:31     ` Vitaly Bursov
  2021-10-28  7:33       ` Pavan Chebbi
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-10-27 10:31 UTC (permalink / raw)
  To: Pavan Chebbi, Siva Reddy Kallam
  Cc: Prashant Sreedharan, Michael Chan, Linux Netdev List


27.10.2021 12:30, Pavan Chebbi wrote:
> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> <siva.kallam@broadcom.com> wrote:
>>
>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
>> We will provide our feedback very soon on this.
>>
>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>
>>> Hi,
>>>
>>> We found a occassional and random (sometimes happens, sometimes not)
>>> packet re-order when NIC is involved in UDP multicast reception, which
>>> is sensitive to a packet re-order. Network capture with tcpdump
>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
>>> a host, re-order in a container at the same time). In a pcap file
>>> re-ordered packets have a correct timestamp - delayed packet had a more
>>> earlier timestamp compared to a previous packet:
>>>       1.00s packet1
>>>       1.20s packet3
>>>       1.10s packet2
>>>       1.30s packet4
>>>
>>> There's about 300Mbps of traffic on this NIC, and server is busy
>>> (hyper-threading enabled, about 50% overall idle) with its
>>> computational application work.
>>>
>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
>>> coalescing configuration, 1 TX queue, 4 RX queues.
>>>
>>> After further investigation, I believe that there are two separate
>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
>>> unicast UDP.
>>>
>>> Here are the details of how I understand this behavior.
>>>
>>> 1. Packet re-order.
>>>
>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
>>> handles RX queue 0 too:
>>>
>>>       https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>>>
>>>       static int tg3_rx(struct tg3_napi *tnapi, int budget)
>>>       {
>>>               struct tg3 *tp = tnapi->tp;
>>>
>>>               ...
>>>
>>>               /* Refill RX ring(s). */
>>>               if (!tg3_flag(tp, ENABLE_RSS)) {
>>>                       ....
>>>               } else if (work_mask) {
>>>                       ...
>>>
>>>                       if (tnapi != &tp->napi[1]) {
>>>                               tp->rx_refill = true;
>>>                               napi_schedule(&tp->napi[1].napi);
>>>                       }
>>>               }
>>>               ...
>>>       }
>>>
>>>   From napi_schedule() code, it should schedure RX 0 traffic handling on
>>> a current CPU, which handles queues RX1-3 right now.
>>>
>>> At least two traffic flows are required - one on RX queue 0, and the
>>> other on any other queue (1-3). Re-ordering may happend only on flow
>>> from queue 0, the second flow will work fine.
>>>
>>> No idea how to fix this.
> 
> In the case of RSS the actual rings for RX are from 1 to 4.
> The napi of those rings are indeed processing the packets.
> The explicit napi_schedule of napi[1] is only re-filling rx BD
> producer ring because it is shared with return rings for 1-4.
> I tried to repro this but I am not seeing the issue. If you are
> receiving packets on RX 0 then the RSS must have been disabled.
> Can you please check?
> 

# ethtool -i enp2s0f0
driver: tg3
version: 3.137
firmware-version: 5719-v1.46 NCSI v1.5.18.0
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

# ethtool -l enp2s0f0
Channel parameters for enp2s0f0:
Pre-set maximums:
RX:		4
TX:		4
Other:		0
Combined:	0
Current hardware settings:
RX:		4
TX:		1
Other:		0
Combined:	0

# ethtool -x enp2s0f0
RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
     0:      0     1     2     3     0     1     2     3
     8:      0     1     2     3     0     1     2     3
    16:      0     1     2     3     0     1     2     3
    24:      0     1     2     3     0     1     2     3
    32:      0     1     2     3     0     1     2     3
    40:      0     1     2     3     0     1     2     3
    48:      0     1     2     3     0     1     2     3
    56:      0     1     2     3     0     1     2     3
    64:      0     1     2     3     0     1     2     3
    72:      0     1     2     3     0     1     2     3
    80:      0     1     2     3     0     1     2     3
    88:      0     1     2     3     0     1     2     3
    96:      0     1     2     3     0     1     2     3
   104:      0     1     2     3     0     1     2     3
   112:      0     1     2     3     0     1     2     3
   120:      0     1     2     3     0     1     2     3
RSS hash key:
Operation not supported
RSS hash function:
     toeplitz: on
     xor: off
     crc32: off

In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
different CPU cores. Kernel also has "threadirqs" enabled in
command line, I didn't check if this parameter affects the issue.

Yes, some things start with 0, and others with 1, sorry for a confusion
in terminology, what I meant:
  - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
    RX0 is the first queue/ring that actually receives the traffic.
    RX0 is handled by enp2s0f0-rx-1 interrupt.
  - These are related to (tp->napi[i]), but i is in 1..4, so the first
    receiving queue relates to tp->napi[1], the second relates to
    tp->napi[2], and so on. Correct?

Suppose, tg3_rx() is called for tp->napi[2], this function most likely
calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
on a currect CPU, which is designated for tp->napi[2], but not for
tp->napi[1]. Correct?

I don't understand what napi_schedule(&tp->napi[1].napi) does for the
NIC or driver, "re-filling rx BD producer ring" sounds important. I
suspect something will break badly if I simply remove it without
replacing with something more elaborate. I guess along with re-filling
rx BD producer ring it also can process incoming packets. Is it possible?

-- 
Thanks
Vitalii

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-27 10:31     ` Vitaly Bursov
@ 2021-10-28  7:33       ` Pavan Chebbi
  2021-10-28 15:41         ` Vitaly Bursov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavan Chebbi @ 2021-10-28  7:33 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List

On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>
>
> 27.10.2021 12:30, Pavan Chebbi wrote:
> > On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> > <siva.kallam@broadcom.com> wrote:
> >>
> >> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >> We will provide our feedback very soon on this.
> >>
> >> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We found a occassional and random (sometimes happens, sometimes not)
> >>> packet re-order when NIC is involved in UDP multicast reception, which
> >>> is sensitive to a packet re-order. Network capture with tcpdump
> >>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>> a host, re-order in a container at the same time). In a pcap file
> >>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>> earlier timestamp compared to a previous packet:
> >>>       1.00s packet1
> >>>       1.20s packet3
> >>>       1.10s packet2
> >>>       1.30s packet4
> >>>
> >>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>> (hyper-threading enabled, about 50% overall idle) with its
> >>> computational application work.
> >>>
> >>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>
> >>> After further investigation, I believe that there are two separate
> >>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>> unicast UDP.
> >>>
> >>> Here are the details of how I understand this behavior.
> >>>
> >>> 1. Packet re-order.
> >>>
> >>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>> handles RX queue 0 too:
> >>>
> >>>       https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>
> >>>       static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>       {
> >>>               struct tg3 *tp = tnapi->tp;
> >>>
> >>>               ...
> >>>
> >>>               /* Refill RX ring(s). */
> >>>               if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>                       ....
> >>>               } else if (work_mask) {
> >>>                       ...
> >>>
> >>>                       if (tnapi != &tp->napi[1]) {
> >>>                               tp->rx_refill = true;
> >>>                               napi_schedule(&tp->napi[1].napi);
> >>>                       }
> >>>               }
> >>>               ...
> >>>       }
> >>>
> >>>   From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>> a current CPU, which handles queues RX1-3 right now.
> >>>
> >>> At least two traffic flows are required - one on RX queue 0, and the
> >>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>> from queue 0, the second flow will work fine.
> >>>
> >>> No idea how to fix this.
> >
> > In the case of RSS the actual rings for RX are from 1 to 4.
> > The napi of those rings are indeed processing the packets.
> > The explicit napi_schedule of napi[1] is only re-filling rx BD
> > producer ring because it is shared with return rings for 1-4.
> > I tried to repro this but I am not seeing the issue. If you are
> > receiving packets on RX 0 then the RSS must have been disabled.
> > Can you please check?
> >
>
> # ethtool -i enp2s0f0
> driver: tg3
> version: 3.137
> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> # ethtool -l enp2s0f0
> Channel parameters for enp2s0f0:
> Pre-set maximums:
> RX:             4
> TX:             4
> Other:          0
> Combined:       0
> Current hardware settings:
> RX:             4
> TX:             1
> Other:          0
> Combined:       0
>
> # ethtool -x enp2s0f0
> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>      0:      0     1     2     3     0     1     2     3
>      8:      0     1     2     3     0     1     2     3
>     16:      0     1     2     3     0     1     2     3
>     24:      0     1     2     3     0     1     2     3
>     32:      0     1     2     3     0     1     2     3
>     40:      0     1     2     3     0     1     2     3
>     48:      0     1     2     3     0     1     2     3
>     56:      0     1     2     3     0     1     2     3
>     64:      0     1     2     3     0     1     2     3
>     72:      0     1     2     3     0     1     2     3
>     80:      0     1     2     3     0     1     2     3
>     88:      0     1     2     3     0     1     2     3
>     96:      0     1     2     3     0     1     2     3
>    104:      0     1     2     3     0     1     2     3
>    112:      0     1     2     3     0     1     2     3
>    120:      0     1     2     3     0     1     2     3
> RSS hash key:
> Operation not supported
> RSS hash function:
>      toeplitz: on
>      xor: off
>      crc32: off
>
> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> different CPU cores. Kernel also has "threadirqs" enabled in
> command line, I didn't check if this parameter affects the issue.
>
> Yes, some things start with 0, and others with 1, sorry for a confusion
> in terminology, what I meant:
>   - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>     RX0 is the first queue/ring that actually receives the traffic.
>     RX0 is handled by enp2s0f0-rx-1 interrupt.
>   - These are related to (tp->napi[i]), but i is in 1..4, so the first
>     receiving queue relates to tp->napi[1], the second relates to
>     tp->napi[2], and so on. Correct?
>
> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> on a currect CPU, which is designated for tp->napi[2], but not for
> tp->napi[1]. Correct?
>
> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> suspect something will break badly if I simply remove it without
> replacing with something more elaborate. I guess along with re-filling
> rx BD producer ring it also can process incoming packets. Is it possible?
>

Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
won't process
any rx packets because the producer index of napi[1] has not changed. If the
producer count did change, then we get a poll from the ISR for napi[1]
to process
packets. So it is mostly used to re-fill rx buffers when called
explicitly. However
there could be a small window where the prod index is incremented but the ISR
is not fired yet. It may process some small no of packets. But I don't
think this
should lead to a reorder problem.


> --
> Thanks
> Vitalii

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-28  7:33       ` Pavan Chebbi
@ 2021-10-28 15:41         ` Vitaly Bursov
  2021-10-29  5:04           ` Pavan Chebbi
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-10-28 15:41 UTC (permalink / raw)
  To: Pavan Chebbi
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List


28.10.2021 10:33, Pavan Chebbi wrote:
> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>
>>
>> 27.10.2021 12:30, Pavan Chebbi wrote:
>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
>>> <siva.kallam@broadcom.com> wrote:
>>>>
>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
>>>> We will provide our feedback very soon on this.
>>>>
>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We found a occassional and random (sometimes happens, sometimes not)
>>>>> packet re-order when NIC is involved in UDP multicast reception, which
>>>>> is sensitive to a packet re-order. Network capture with tcpdump
>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
>>>>> a host, re-order in a container at the same time). In a pcap file
>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
>>>>> earlier timestamp compared to a previous packet:
>>>>>        1.00s packet1
>>>>>        1.20s packet3
>>>>>        1.10s packet2
>>>>>        1.30s packet4
>>>>>
>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
>>>>> (hyper-threading enabled, about 50% overall idle) with its
>>>>> computational application work.
>>>>>
>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
>>>>>
>>>>> After further investigation, I believe that there are two separate
>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
>>>>> unicast UDP.
>>>>>
>>>>> Here are the details of how I understand this behavior.
>>>>>
>>>>> 1. Packet re-order.
>>>>>
>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
>>>>> handles RX queue 0 too:
>>>>>
>>>>>        https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>>>>>
>>>>>        static int tg3_rx(struct tg3_napi *tnapi, int budget)
>>>>>        {
>>>>>                struct tg3 *tp = tnapi->tp;
>>>>>
>>>>>                ...
>>>>>
>>>>>                /* Refill RX ring(s). */
>>>>>                if (!tg3_flag(tp, ENABLE_RSS)) {
>>>>>                        ....
>>>>>                } else if (work_mask) {
>>>>>                        ...
>>>>>
>>>>>                        if (tnapi != &tp->napi[1]) {
>>>>>                                tp->rx_refill = true;
>>>>>                                napi_schedule(&tp->napi[1].napi);
>>>>>                        }
>>>>>                }
>>>>>                ...
>>>>>        }
>>>>>
>>>>>    From napi_schedule() code, it should schedure RX 0 traffic handling on
>>>>> a current CPU, which handles queues RX1-3 right now.
>>>>>
>>>>> At least two traffic flows are required - one on RX queue 0, and the
>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
>>>>> from queue 0, the second flow will work fine.
>>>>>
>>>>> No idea how to fix this.
>>>
>>> In the case of RSS the actual rings for RX are from 1 to 4.
>>> The napi of those rings are indeed processing the packets.
>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
>>> producer ring because it is shared with return rings for 1-4.
>>> I tried to repro this but I am not seeing the issue. If you are
>>> receiving packets on RX 0 then the RSS must have been disabled.
>>> Can you please check?
>>>
>>
>> # ethtool -i enp2s0f0
>> driver: tg3
>> version: 3.137
>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
>> expansion-rom-version:
>> bus-info: 0000:02:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: no
>>
>> # ethtool -l enp2s0f0
>> Channel parameters for enp2s0f0:
>> Pre-set maximums:
>> RX:             4
>> TX:             4
>> Other:          0
>> Combined:       0
>> Current hardware settings:
>> RX:             4
>> TX:             1
>> Other:          0
>> Combined:       0
>>
>> # ethtool -x enp2s0f0
>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>>       0:      0     1     2     3     0     1     2     3
>>       8:      0     1     2     3     0     1     2     3
>>      16:      0     1     2     3     0     1     2     3
>>      24:      0     1     2     3     0     1     2     3
>>      32:      0     1     2     3     0     1     2     3
>>      40:      0     1     2     3     0     1     2     3
>>      48:      0     1     2     3     0     1     2     3
>>      56:      0     1     2     3     0     1     2     3
>>      64:      0     1     2     3     0     1     2     3
>>      72:      0     1     2     3     0     1     2     3
>>      80:      0     1     2     3     0     1     2     3
>>      88:      0     1     2     3     0     1     2     3
>>      96:      0     1     2     3     0     1     2     3
>>     104:      0     1     2     3     0     1     2     3
>>     112:      0     1     2     3     0     1     2     3
>>     120:      0     1     2     3     0     1     2     3
>> RSS hash key:
>> Operation not supported
>> RSS hash function:
>>       toeplitz: on
>>       xor: off
>>       crc32: off
>>
>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
>> different CPU cores. Kernel also has "threadirqs" enabled in
>> command line, I didn't check if this parameter affects the issue.
>>
>> Yes, some things start with 0, and others with 1, sorry for a confusion
>> in terminology, what I meant:
>>    - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>>      RX0 is the first queue/ring that actually receives the traffic.
>>      RX0 is handled by enp2s0f0-rx-1 interrupt.
>>    - These are related to (tp->napi[i]), but i is in 1..4, so the first
>>      receiving queue relates to tp->napi[1], the second relates to
>>      tp->napi[2], and so on. Correct?
>>
>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
>> on a currect CPU, which is designated for tp->napi[2], but not for
>> tp->napi[1]. Correct?
>>
>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
>> suspect something will break badly if I simply remove it without
>> replacing with something more elaborate. I guess along with re-filling
>> rx BD producer ring it also can process incoming packets. Is it possible?
>>
> 
> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> won't process
> any rx packets because the producer index of napi[1] has not changed. If the
> producer count did change, then we get a poll from the ISR for napi[1]
> to process
> packets. So it is mostly used to re-fill rx buffers when called
> explicitly. However
> there could be a small window where the prod index is incremented but the ISR
> is not fired yet. It may process some small no of packets. But I don't
> think this
> should lead to a reorder problem.
> 

I tried to reproduce without using bridge and veth interfaces, and it seems
like it's not reproducible, so traffic forwarding via a bridge interface may
be necessary. It also does not happen if traffic load is low, but moderate
load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
to reproduce with a higher load.

With about the same setup as in an original message (bridge + veth 2
network namespaces), irqbalance daemon stopped, if traffic flows via
enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
still gets some interrupts, but at a much lower rate compared to 2 and
4.

namespace 1:
   # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
   - - - - - - - - - - - - - - - - - - - - - - - - -
   [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
   [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
   [  4] Sent 55508325 datagrams

   iperf Done.

namespace 2:
   # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
   - - - - - - - - - - - - - - - - - - - - - - - - -
   [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
   [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
   [  4] Sent 56414001 datagrams

   iperf Done.


With the same configuration but different IP address so that instead of
enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.


namespace 1 (client IP was changed):
   # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
   - - - - - - - - - - - - - - - - - - - - - - - - -
   [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
   [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
   [  4] Sent 52172059 datagrams
   [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order

   iperf Done.

namespace 2:
   # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
   - - - - - - - - - - - - - - - - - - - - - - - - -
   [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
   [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
   [  4] Sent 54463973 datagrams

   iperf Done.

Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
it's following the IP address.


Is there something I could check to confirm that this behavior is or is not
related to napi_schedule(&tp->napi[1].napi) call?

-- 
Thanks
Vitalii

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-28 15:41         ` Vitaly Bursov
@ 2021-10-29  5:04           ` Pavan Chebbi
  2021-10-29 15:45             ` Vitaly Bursov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavan Chebbi @ 2021-10-29  5:04 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List

90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>
>
> 28.10.2021 10:33, Pavan Chebbi wrote:
> > On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>
> >>
> >> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>> <siva.kallam@broadcom.com> wrote:
> >>>>
> >>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>> We will provide our feedback very soon on this.
> >>>>
> >>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>> earlier timestamp compared to a previous packet:
> >>>>>        1.00s packet1
> >>>>>        1.20s packet3
> >>>>>        1.10s packet2
> >>>>>        1.30s packet4
> >>>>>
> >>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>> computational application work.
> >>>>>
> >>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>
> >>>>> After further investigation, I believe that there are two separate
> >>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>> unicast UDP.
> >>>>>
> >>>>> Here are the details of how I understand this behavior.
> >>>>>
> >>>>> 1. Packet re-order.
> >>>>>
> >>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>> handles RX queue 0 too:
> >>>>>
> >>>>>        https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>
> >>>>>        static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>        {
> >>>>>                struct tg3 *tp = tnapi->tp;
> >>>>>
> >>>>>                ...
> >>>>>
> >>>>>                /* Refill RX ring(s). */
> >>>>>                if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>                        ....
> >>>>>                } else if (work_mask) {
> >>>>>                        ...
> >>>>>
> >>>>>                        if (tnapi != &tp->napi[1]) {
> >>>>>                                tp->rx_refill = true;
> >>>>>                                napi_schedule(&tp->napi[1].napi);
> >>>>>                        }
> >>>>>                }
> >>>>>                ...
> >>>>>        }
> >>>>>
> >>>>>    From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>
> >>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>> from queue 0, the second flow will work fine.
> >>>>>
> >>>>> No idea how to fix this.
> >>>
> >>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>> The napi of those rings are indeed processing the packets.
> >>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>> producer ring because it is shared with return rings for 1-4.
> >>> I tried to repro this but I am not seeing the issue. If you are
> >>> receiving packets on RX 0 then the RSS must have been disabled.
> >>> Can you please check?
> >>>
> >>
> >> # ethtool -i enp2s0f0
> >> driver: tg3
> >> version: 3.137
> >> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >> expansion-rom-version:
> >> bus-info: 0000:02:00.0
> >> supports-statistics: yes
> >> supports-test: yes
> >> supports-eeprom-access: yes
> >> supports-register-dump: yes
> >> supports-priv-flags: no
> >>
> >> # ethtool -l enp2s0f0
> >> Channel parameters for enp2s0f0:
> >> Pre-set maximums:
> >> RX:             4
> >> TX:             4
> >> Other:          0
> >> Combined:       0
> >> Current hardware settings:
> >> RX:             4
> >> TX:             1
> >> Other:          0
> >> Combined:       0
> >>
> >> # ethtool -x enp2s0f0
> >> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>       0:      0     1     2     3     0     1     2     3
> >>       8:      0     1     2     3     0     1     2     3
> >>      16:      0     1     2     3     0     1     2     3
> >>      24:      0     1     2     3     0     1     2     3
> >>      32:      0     1     2     3     0     1     2     3
> >>      40:      0     1     2     3     0     1     2     3
> >>      48:      0     1     2     3     0     1     2     3
> >>      56:      0     1     2     3     0     1     2     3
> >>      64:      0     1     2     3     0     1     2     3
> >>      72:      0     1     2     3     0     1     2     3
> >>      80:      0     1     2     3     0     1     2     3
> >>      88:      0     1     2     3     0     1     2     3
> >>      96:      0     1     2     3     0     1     2     3
> >>     104:      0     1     2     3     0     1     2     3
> >>     112:      0     1     2     3     0     1     2     3
> >>     120:      0     1     2     3     0     1     2     3
> >> RSS hash key:
> >> Operation not supported
> >> RSS hash function:
> >>       toeplitz: on
> >>       xor: off
> >>       crc32: off
> >>
> >> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >> different CPU cores. Kernel also has "threadirqs" enabled in
> >> command line, I didn't check if this parameter affects the issue.
> >>
> >> Yes, some things start with 0, and others with 1, sorry for a confusion
> >> in terminology, what I meant:
> >>    - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>      RX0 is the first queue/ring that actually receives the traffic.
> >>      RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>    - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>      receiving queue relates to tp->napi[1], the second relates to
> >>      tp->napi[2], and so on. Correct?
> >>
> >> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >> on a currect CPU, which is designated for tp->napi[2], but not for
> >> tp->napi[1]. Correct?
> >>
> >> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >> suspect something will break badly if I simply remove it without
> >> replacing with something more elaborate. I guess along with re-filling
> >> rx BD producer ring it also can process incoming packets. Is it possible?
> >>
> >
> > Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> > won't process
> > any rx packets because the producer index of napi[1] has not changed. If the
> > producer count did change, then we get a poll from the ISR for napi[1]
> > to process
> > packets. So it is mostly used to re-fill rx buffers when called
> > explicitly. However
> > there could be a small window where the prod index is incremented but the ISR
> > is not fired yet. It may process some small no of packets. But I don't
> > think this
> > should lead to a reorder problem.
> >
>
> I tried to reproduce without using bridge and veth interfaces, and it seems
> like it's not reproducible, so traffic forwarding via a bridge interface may
> be necessary. It also does not happen if traffic load is low, but moderate
> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> to reproduce with a higher load.
>
> With about the same setup as in an original message (bridge + veth 2
> network namespaces), irqbalance daemon stopped, if traffic flows via
> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> still gets some interrupts, but at a much lower rate compared to 2 and
> 4.
>
> namespace 1:
>    # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>    - - - - - - - - - - - - - - - - - - - - - - - - -
>    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>    [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
>    [  4] Sent 55508325 datagrams
>
>    iperf Done.
>
> namespace 2:
>    # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>    - - - - - - - - - - - - - - - - - - - - - - - - -
>    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>    [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
>    [  4] Sent 56414001 datagrams
>
>    iperf Done.
>
>
> With the same configuration but different IP address so that instead of
> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
>
>
> namespace 1 (client IP was changed):
>    # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>    - - - - - - - - - - - - - - - - - - - - - - - - -
>    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>    [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
>    [  4] Sent 52172059 datagrams
>    [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
>
>    iperf Done.
>
> namespace 2:
>    # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>    - - - - - - - - - - - - - - - - - - - - - - - - -
>    [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>    [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
>    [  4] Sent 54463973 datagrams
>
>    iperf Done.
>
> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> it's following the IP address.
>
>
> Is there something I could check to confirm that this behavior is or is not
> related to napi_schedule(&tp->napi[1].napi) call?

in the function tg3_msi_1shot() you could store the cpu assigned to
tnapi1 (inside the struct tg3_napi)
and then in tg3_poll_work() you can add another check after
        if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
something like
if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
only then execute tg3_rx()

This may stop tnapi 1 from reading rx pkts on the current CPU from
which refill is called.

>
> --
> Thanks
> Vitalii

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-29  5:04           ` Pavan Chebbi
@ 2021-10-29 15:45             ` Vitaly Bursov
  2021-11-01  7:06               ` Pavan Chebbi
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-10-29 15:45 UTC (permalink / raw)
  To: Pavan Chebbi
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List



29.10.2021 08:04, Pavan Chebbi пишет:
> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>
>>
>> 28.10.2021 10:33, Pavan Chebbi wrote:
>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>
>>>>
>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
>>>>> <siva.kallam@broadcom.com> wrote:
>>>>>>
>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
>>>>>> We will provide our feedback very soon on this.
>>>>>>
>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
>>>>>>> a host, re-order in a container at the same time). In a pcap file
>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
>>>>>>> earlier timestamp compared to a previous packet:
>>>>>>>         1.00s packet1
>>>>>>>         1.20s packet3
>>>>>>>         1.10s packet2
>>>>>>>         1.30s packet4
>>>>>>>
>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
>>>>>>> computational application work.
>>>>>>>
>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
>>>>>>>
>>>>>>> After further investigation, I believe that there are two separate
>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
>>>>>>> unicast UDP.
>>>>>>>
>>>>>>> Here are the details of how I understand this behavior.
>>>>>>>
>>>>>>> 1. Packet re-order.
>>>>>>>
>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
>>>>>>> handles RX queue 0 too:
>>>>>>>
>>>>>>>         https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>>>>>>>
>>>>>>>         static int tg3_rx(struct tg3_napi *tnapi, int budget)
>>>>>>>         {
>>>>>>>                 struct tg3 *tp = tnapi->tp;
>>>>>>>
>>>>>>>                 ...
>>>>>>>
>>>>>>>                 /* Refill RX ring(s). */
>>>>>>>                 if (!tg3_flag(tp, ENABLE_RSS)) {
>>>>>>>                         ....
>>>>>>>                 } else if (work_mask) {
>>>>>>>                         ...
>>>>>>>
>>>>>>>                         if (tnapi != &tp->napi[1]) {
>>>>>>>                                 tp->rx_refill = true;
>>>>>>>                                 napi_schedule(&tp->napi[1].napi);
>>>>>>>                         }
>>>>>>>                 }
>>>>>>>                 ...
>>>>>>>         }
>>>>>>>
>>>>>>>     From napi_schedule() code, it should schedure RX 0 traffic handling on
>>>>>>> a current CPU, which handles queues RX1-3 right now.
>>>>>>>
>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
>>>>>>> from queue 0, the second flow will work fine.
>>>>>>>
>>>>>>> No idea how to fix this.
>>>>>
>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
>>>>> The napi of those rings are indeed processing the packets.
>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
>>>>> producer ring because it is shared with return rings for 1-4.
>>>>> I tried to repro this but I am not seeing the issue. If you are
>>>>> receiving packets on RX 0 then the RSS must have been disabled.
>>>>> Can you please check?
>>>>>
>>>>
>>>> # ethtool -i enp2s0f0
>>>> driver: tg3
>>>> version: 3.137
>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
>>>> expansion-rom-version:
>>>> bus-info: 0000:02:00.0
>>>> supports-statistics: yes
>>>> supports-test: yes
>>>> supports-eeprom-access: yes
>>>> supports-register-dump: yes
>>>> supports-priv-flags: no
>>>>
>>>> # ethtool -l enp2s0f0
>>>> Channel parameters for enp2s0f0:
>>>> Pre-set maximums:
>>>> RX:             4
>>>> TX:             4
>>>> Other:          0
>>>> Combined:       0
>>>> Current hardware settings:
>>>> RX:             4
>>>> TX:             1
>>>> Other:          0
>>>> Combined:       0
>>>>
>>>> # ethtool -x enp2s0f0
>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>>>>        0:      0     1     2     3     0     1     2     3
>>>>        8:      0     1     2     3     0     1     2     3
>>>>       16:      0     1     2     3     0     1     2     3
>>>>       24:      0     1     2     3     0     1     2     3
>>>>       32:      0     1     2     3     0     1     2     3
>>>>       40:      0     1     2     3     0     1     2     3
>>>>       48:      0     1     2     3     0     1     2     3
>>>>       56:      0     1     2     3     0     1     2     3
>>>>       64:      0     1     2     3     0     1     2     3
>>>>       72:      0     1     2     3     0     1     2     3
>>>>       80:      0     1     2     3     0     1     2     3
>>>>       88:      0     1     2     3     0     1     2     3
>>>>       96:      0     1     2     3     0     1     2     3
>>>>      104:      0     1     2     3     0     1     2     3
>>>>      112:      0     1     2     3     0     1     2     3
>>>>      120:      0     1     2     3     0     1     2     3
>>>> RSS hash key:
>>>> Operation not supported
>>>> RSS hash function:
>>>>        toeplitz: on
>>>>        xor: off
>>>>        crc32: off
>>>>
>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
>>>> different CPU cores. Kernel also has "threadirqs" enabled in
>>>> command line, I didn't check if this parameter affects the issue.
>>>>
>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
>>>> in terminology, what I meant:
>>>>     - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>>>>       RX0 is the first queue/ring that actually receives the traffic.
>>>>       RX0 is handled by enp2s0f0-rx-1 interrupt.
>>>>     - These are related to (tp->napi[i]), but i is in 1..4, so the first
>>>>       receiving queue relates to tp->napi[1], the second relates to
>>>>       tp->napi[2], and so on. Correct?
>>>>
>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
>>>> on a currect CPU, which is designated for tp->napi[2], but not for
>>>> tp->napi[1]. Correct?
>>>>
>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
>>>> suspect something will break badly if I simply remove it without
>>>> replacing with something more elaborate. I guess along with re-filling
>>>> rx BD producer ring it also can process incoming packets. Is it possible?
>>>>
>>>
>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
>>> won't process
>>> any rx packets because the producer index of napi[1] has not changed. If the
>>> producer count did change, then we get a poll from the ISR for napi[1]
>>> to process
>>> packets. So it is mostly used to re-fill rx buffers when called
>>> explicitly. However
>>> there could be a small window where the prod index is incremented but the ISR
>>> is not fired yet. It may process some small no of packets. But I don't
>>> think this
>>> should lead to a reorder problem.
>>>
>>
>> I tried to reproduce without using bridge and veth interfaces, and it seems
>> like it's not reproducible, so traffic forwarding via a bridge interface may
>> be necessary. It also does not happen if traffic load is low, but moderate
>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
>> to reproduce with a higher load.
>>
>> With about the same setup as in an original message (bridge + veth 2
>> network namespaces), irqbalance daemon stopped, if traffic flows via
>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
>> still gets some interrupts, but at a much lower rate compared to 2 and
>> 4.
>>
>> namespace 1:
>>     # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>     - - - - - - - - - - - - - - - - - - - - - - - - -
>>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>     [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
>>     [  4] Sent 55508325 datagrams
>>
>>     iperf Done.
>>
>> namespace 2:
>>     # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>     - - - - - - - - - - - - - - - - - - - - - - - - -
>>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>     [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
>>     [  4] Sent 56414001 datagrams
>>
>>     iperf Done.
>>
>>
>> With the same configuration but different IP address so that instead of
>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
>>
>>
>> namespace 1 (client IP was changed):
>>     # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>     - - - - - - - - - - - - - - - - - - - - - - - - -
>>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>     [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
>>     [  4] Sent 52172059 datagrams
>>     [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
>>
>>     iperf Done.
>>
>> namespace 2:
>>     # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>     - - - - - - - - - - - - - - - - - - - - - - - - -
>>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>     [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
>>     [  4] Sent 54463973 datagrams
>>
>>     iperf Done.
>>
>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
>> it's following the IP address.
>>
>>
>> Is there something I could check to confirm that this behavior is or is not
>> related to napi_schedule(&tp->napi[1].napi) call?
> 
> in the function tg3_msi_1shot() you could store the cpu assigned to
> tnapi1 (inside the struct tg3_napi)
> and then in tg3_poll_work() you can add another check after
>          if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> something like
> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> only then execute tg3_rx()
> 
> This may stop tnapi 1 from reading rx pkts on the current CPU from
> which refill is called.
> 

Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
there's an infinite loop, and after I added "work_done = budget;", it still doesn't
work - traffic does not flow.

I added logging instead:

+		if (tnapi->assigned_cpu != smp_processor_id())
+			net_dbg_ratelimited("tg3 napi %ld cpu %d %d",
+			    tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
  		napi_gro_receive(&tnapi->napi, skb);

And with two iperf3 streams, there's a lot of messages:
[ 3242.007898] tg3 napi 1 cpu 10 48
[ 3242.007899] tg3 napi 1 cpu 10 48
[ 3242.007911] tg3 napi 1 cpu 10 48
[ 3242.007913] tg3 napi 1 cpu 10 48
[ 3247.011898] net_ratelimit: 546560 callbacks suppressed
[ 3247.011900] tg3 napi 1 cpu 10 48
[ 3247.011902] tg3 napi 1 cpu 10 48
[ 3247.011904] tg3 napi 1 cpu 10 48
[ 3247.011905] tg3 napi 1 cpu 10 48
[ 3247.011906] tg3 napi 1 cpu 10 48
[ 3247.011928] tg3 napi 1 cpu 10 48
[ 3247.011929] tg3 napi 1 cpu 10 48
[ 3247.011931] tg3 napi 1 cpu 10 48
[ 3247.011932] tg3 napi 1 cpu 10 48
[ 3247.011933] tg3 napi 1 cpu 10 48
[ 3252.015885] net_ratelimit: 539574 callbacks suppressed
[ 3252.015888] tg3 napi 1 cpu 10 48
[ 3252.015889] tg3 napi 1 cpu 10 48
[ 3252.015891] tg3 napi 1 cpu 10 48
[ 3252.015892] tg3 napi 1 cpu 10 48

cpu 10, enp2s0f0-rx-1
# cat /proc/irq/106/effective_affinity
00000000,00000000,00000400

cpu 48, enp2s0f0-rx-4
# cat /proc/irq/109/effective_affinity
00000000,00010000,00000000

Among all printed messages, there's only "napi 1".

There's also a difference in interrupt thread's CPU usage:
201570 root     -51   0       0      0      0 R  64.3  0.0   1:46.91 irq/109-enp2s0f
204687 root      20   0    9628   2084   1976 R  37.5  0.0   1:04.74 iperf3
205354 root      20   0    9628   2060   1948 R  36.7  0.0   1:01.06 iperf3
201567 root     -51   0       0      0      0 R  23.3  0.0   0:44.45 irq/106-enp2s0f

The sender is CPU-bound, so there's no overload on RX side with tg3

-- 
Thanks
Vitalii


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-10-29 15:45             ` Vitaly Bursov
@ 2021-11-01  7:06               ` Pavan Chebbi
  2021-11-01  8:20                 ` Vitaly Bursov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavan Chebbi @ 2021-11-01  7:06 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List

On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>
>
>
> 29.10.2021 08:04, Pavan Chebbi пишет:
> > 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>
> >>
> >> 28.10.2021 10:33, Pavan Chebbi wrote:
> >>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>
> >>>>
> >>>> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>>>> <siva.kallam@broadcom.com> wrote:
> >>>>>>
> >>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>>>> We will provide our feedback very soon on this.
> >>>>>>
> >>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>>>> earlier timestamp compared to a previous packet:
> >>>>>>>         1.00s packet1
> >>>>>>>         1.20s packet3
> >>>>>>>         1.10s packet2
> >>>>>>>         1.30s packet4
> >>>>>>>
> >>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>>>> computational application work.
> >>>>>>>
> >>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>>>
> >>>>>>> After further investigation, I believe that there are two separate
> >>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>>>> unicast UDP.
> >>>>>>>
> >>>>>>> Here are the details of how I understand this behavior.
> >>>>>>>
> >>>>>>> 1. Packet re-order.
> >>>>>>>
> >>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>>>> handles RX queue 0 too:
> >>>>>>>
> >>>>>>>         https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>>>
> >>>>>>>         static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>>>         {
> >>>>>>>                 struct tg3 *tp = tnapi->tp;
> >>>>>>>
> >>>>>>>                 ...
> >>>>>>>
> >>>>>>>                 /* Refill RX ring(s). */
> >>>>>>>                 if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>>>                         ....
> >>>>>>>                 } else if (work_mask) {
> >>>>>>>                         ...
> >>>>>>>
> >>>>>>>                         if (tnapi != &tp->napi[1]) {
> >>>>>>>                                 tp->rx_refill = true;
> >>>>>>>                                 napi_schedule(&tp->napi[1].napi);
> >>>>>>>                         }
> >>>>>>>                 }
> >>>>>>>                 ...
> >>>>>>>         }
> >>>>>>>
> >>>>>>>     From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>>>
> >>>>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>>>> from queue 0, the second flow will work fine.
> >>>>>>>
> >>>>>>> No idea how to fix this.
> >>>>>
> >>>>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>>>> The napi of those rings are indeed processing the packets.
> >>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>>>> producer ring because it is shared with return rings for 1-4.
> >>>>> I tried to repro this but I am not seeing the issue. If you are
> >>>>> receiving packets on RX 0 then the RSS must have been disabled.
> >>>>> Can you please check?
> >>>>>
> >>>>
> >>>> # ethtool -i enp2s0f0
> >>>> driver: tg3
> >>>> version: 3.137
> >>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >>>> expansion-rom-version:
> >>>> bus-info: 0000:02:00.0
> >>>> supports-statistics: yes
> >>>> supports-test: yes
> >>>> supports-eeprom-access: yes
> >>>> supports-register-dump: yes
> >>>> supports-priv-flags: no
> >>>>
> >>>> # ethtool -l enp2s0f0
> >>>> Channel parameters for enp2s0f0:
> >>>> Pre-set maximums:
> >>>> RX:             4
> >>>> TX:             4
> >>>> Other:          0
> >>>> Combined:       0
> >>>> Current hardware settings:
> >>>> RX:             4
> >>>> TX:             1
> >>>> Other:          0
> >>>> Combined:       0
> >>>>
> >>>> # ethtool -x enp2s0f0
> >>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>>>        0:      0     1     2     3     0     1     2     3
> >>>>        8:      0     1     2     3     0     1     2     3
> >>>>       16:      0     1     2     3     0     1     2     3
> >>>>       24:      0     1     2     3     0     1     2     3
> >>>>       32:      0     1     2     3     0     1     2     3
> >>>>       40:      0     1     2     3     0     1     2     3
> >>>>       48:      0     1     2     3     0     1     2     3
> >>>>       56:      0     1     2     3     0     1     2     3
> >>>>       64:      0     1     2     3     0     1     2     3
> >>>>       72:      0     1     2     3     0     1     2     3
> >>>>       80:      0     1     2     3     0     1     2     3
> >>>>       88:      0     1     2     3     0     1     2     3
> >>>>       96:      0     1     2     3     0     1     2     3
> >>>>      104:      0     1     2     3     0     1     2     3
> >>>>      112:      0     1     2     3     0     1     2     3
> >>>>      120:      0     1     2     3     0     1     2     3
> >>>> RSS hash key:
> >>>> Operation not supported
> >>>> RSS hash function:
> >>>>        toeplitz: on
> >>>>        xor: off
> >>>>        crc32: off
> >>>>
> >>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >>>> different CPU cores. Kernel also has "threadirqs" enabled in
> >>>> command line, I didn't check if this parameter affects the issue.
> >>>>
> >>>> Yes, some things start with 0, and others with 1, sorry for a confusion
> >>>> in terminology, what I meant:
> >>>>     - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>>>       RX0 is the first queue/ring that actually receives the traffic.
> >>>>       RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>>>     - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>>>       receiving queue relates to tp->napi[1], the second relates to
> >>>>       tp->napi[2], and so on. Correct?
> >>>>
> >>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >>>> on a currect CPU, which is designated for tp->napi[2], but not for
> >>>> tp->napi[1]. Correct?
> >>>>
> >>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >>>> suspect something will break badly if I simply remove it without
> >>>> replacing with something more elaborate. I guess along with re-filling
> >>>> rx BD producer ring it also can process incoming packets. Is it possible?
> >>>>
> >>>
> >>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> >>> won't process
> >>> any rx packets because the producer index of napi[1] has not changed. If the
> >>> producer count did change, then we get a poll from the ISR for napi[1]
> >>> to process
> >>> packets. So it is mostly used to re-fill rx buffers when called
> >>> explicitly. However
> >>> there could be a small window where the prod index is incremented but the ISR
> >>> is not fired yet. It may process some small no of packets. But I don't
> >>> think this
> >>> should lead to a reorder problem.
> >>>
> >>
> >> I tried to reproduce without using bridge and veth interfaces, and it seems
> >> like it's not reproducible, so traffic forwarding via a bridge interface may
> >> be necessary. It also does not happen if traffic load is low, but moderate
> >> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> >> to reproduce with a higher load.
> >>
> >> With about the same setup as in an original message (bridge + veth 2
> >> network namespaces), irqbalance daemon stopped, if traffic flows via
> >> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> >> still gets some interrupts, but at a much lower rate compared to 2 and
> >> 4.
> >>
> >> namespace 1:
> >>     # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>     - - - - - - - - - - - - - - - - - - - - - - - - -
> >>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>     [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
> >>     [  4] Sent 55508325 datagrams
> >>
> >>     iperf Done.
> >>
> >> namespace 2:
> >>     # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>     - - - - - - - - - - - - - - - - - - - - - - - - -
> >>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>     [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
> >>     [  4] Sent 56414001 datagrams
> >>
> >>     iperf Done.
> >>
> >>
> >> With the same configuration but different IP address so that instead of
> >> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
> >>
> >>
> >> namespace 1 (client IP was changed):
> >>     # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>     - - - - - - - - - - - - - - - - - - - - - - - - -
> >>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>     [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
> >>     [  4] Sent 52172059 datagrams
> >>     [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
> >>
> >>     iperf Done.
> >>
> >> namespace 2:
> >>     # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>     - - - - - - - - - - - - - - - - - - - - - - - - -
> >>     [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>     [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
> >>     [  4] Sent 54463973 datagrams
> >>
> >>     iperf Done.
> >>
> >> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> >> it's following the IP address.
> >>
> >>
> >> Is there something I could check to confirm that this behavior is or is not
> >> related to napi_schedule(&tp->napi[1].napi) call?
> >
> > in the function tg3_msi_1shot() you could store the cpu assigned to
> > tnapi1 (inside the struct tg3_napi)
> > and then in tg3_poll_work() you can add another check after
> >          if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> > something like
> > if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> > only then execute tg3_rx()
> >
> > This may stop tnapi 1 from reading rx pkts on the current CPU from
> > which refill is called.
> >
>
> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
> work - traffic does not flow.
>

I think the easiest way is to modify the tg3_rx() calling condition
like below inside
tg3_poll_work() :

if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
        if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
!tp->rx_refill)) {
                        work_done += tg3_rx(tnapi, budget - work_done);
        }
}

This will prevent reading rx packets when napi[1] is scheduled only for refill.
Can you see if this works?

> I added logging instead:
>
> +               if (tnapi->assigned_cpu != smp_processor_id())
> +                       net_dbg_ratelimited("tg3 napi %ld cpu %d %d",
> +                           tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
>                 napi_gro_receive(&tnapi->napi, skb);
>
> And with two iperf3 streams, there's a lot of messages:
> [ 3242.007898] tg3 napi 1 cpu 10 48
> [ 3242.007899] tg3 napi 1 cpu 10 48
> [ 3242.007911] tg3 napi 1 cpu 10 48
> [ 3242.007913] tg3 napi 1 cpu 10 48
> [ 3247.011898] net_ratelimit: 546560 callbacks suppressed
> [ 3247.011900] tg3 napi 1 cpu 10 48
> [ 3247.011902] tg3 napi 1 cpu 10 48
> [ 3247.011904] tg3 napi 1 cpu 10 48
> [ 3247.011905] tg3 napi 1 cpu 10 48
> [ 3247.011906] tg3 napi 1 cpu 10 48
> [ 3247.011928] tg3 napi 1 cpu 10 48
> [ 3247.011929] tg3 napi 1 cpu 10 48
> [ 3247.011931] tg3 napi 1 cpu 10 48
> [ 3247.011932] tg3 napi 1 cpu 10 48
> [ 3247.011933] tg3 napi 1 cpu 10 48
> [ 3252.015885] net_ratelimit: 539574 callbacks suppressed
> [ 3252.015888] tg3 napi 1 cpu 10 48
> [ 3252.015889] tg3 napi 1 cpu 10 48
> [ 3252.015891] tg3 napi 1 cpu 10 48
> [ 3252.015892] tg3 napi 1 cpu 10 48
>
> cpu 10, enp2s0f0-rx-1
> # cat /proc/irq/106/effective_affinity
> 00000000,00000000,00000400
>
> cpu 48, enp2s0f0-rx-4
> # cat /proc/irq/109/effective_affinity
> 00000000,00010000,00000000
>
> Among all printed messages, there's only "napi 1".
>
> There's also a difference in interrupt thread's CPU usage:
> 201570 root     -51   0       0      0      0 R  64.3  0.0   1:46.91 irq/109-enp2s0f
> 204687 root      20   0    9628   2084   1976 R  37.5  0.0   1:04.74 iperf3
> 205354 root      20   0    9628   2060   1948 R  36.7  0.0   1:01.06 iperf3
> 201567 root     -51   0       0      0      0 R  23.3  0.0   0:44.45 irq/106-enp2s0f
>
> The sender is CPU-bound, so there's no overload on RX side with tg3
>
> --
> Thanks
> Vitalii
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-11-01  7:06               ` Pavan Chebbi
@ 2021-11-01  8:20                 ` Vitaly Bursov
  2021-11-01  9:10                   ` Pavan Chebbi
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-11-01  8:20 UTC (permalink / raw)
  To: Pavan Chebbi
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List



01.11.2021 09:06, Pavan Chebbi wrote:
> On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>
>>
>>
>> 29.10.2021 08:04, Pavan Chebbi пишет:
>>> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>
>>>>
>>>> 28.10.2021 10:33, Pavan Chebbi wrote:
>>>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>
>>>>>>
>>>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
>>>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
>>>>>>> <siva.kallam@broadcom.com> wrote:
>>>>>>>>
>>>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
>>>>>>>> We will provide our feedback very soon on this.
>>>>>>>>
>>>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
>>>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
>>>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
>>>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
>>>>>>>>> a host, re-order in a container at the same time). In a pcap file
>>>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
>>>>>>>>> earlier timestamp compared to a previous packet:
>>>>>>>>>          1.00s packet1
>>>>>>>>>          1.20s packet3
>>>>>>>>>          1.10s packet2
>>>>>>>>>          1.30s packet4
>>>>>>>>>
>>>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
>>>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
>>>>>>>>> computational application work.
>>>>>>>>>
>>>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
>>>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
>>>>>>>>>
>>>>>>>>> After further investigation, I believe that there are two separate
>>>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
>>>>>>>>> unicast UDP.
>>>>>>>>>
>>>>>>>>> Here are the details of how I understand this behavior.
>>>>>>>>>
>>>>>>>>> 1. Packet re-order.
>>>>>>>>>
>>>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
>>>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
>>>>>>>>> handles RX queue 0 too:
>>>>>>>>>
>>>>>>>>>          https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>>>>>>>>>
>>>>>>>>>          static int tg3_rx(struct tg3_napi *tnapi, int budget)
>>>>>>>>>          {
>>>>>>>>>                  struct tg3 *tp = tnapi->tp;
>>>>>>>>>
>>>>>>>>>                  ...
>>>>>>>>>
>>>>>>>>>                  /* Refill RX ring(s). */
>>>>>>>>>                  if (!tg3_flag(tp, ENABLE_RSS)) {
>>>>>>>>>                          ....
>>>>>>>>>                  } else if (work_mask) {
>>>>>>>>>                          ...
>>>>>>>>>
>>>>>>>>>                          if (tnapi != &tp->napi[1]) {
>>>>>>>>>                                  tp->rx_refill = true;
>>>>>>>>>                                  napi_schedule(&tp->napi[1].napi);
>>>>>>>>>                          }
>>>>>>>>>                  }
>>>>>>>>>                  ...
>>>>>>>>>          }
>>>>>>>>>
>>>>>>>>>      From napi_schedule() code, it should schedure RX 0 traffic handling on
>>>>>>>>> a current CPU, which handles queues RX1-3 right now.
>>>>>>>>>
>>>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
>>>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
>>>>>>>>> from queue 0, the second flow will work fine.
>>>>>>>>>
>>>>>>>>> No idea how to fix this.
>>>>>>>
>>>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
>>>>>>> The napi of those rings are indeed processing the packets.
>>>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
>>>>>>> producer ring because it is shared with return rings for 1-4.
>>>>>>> I tried to repro this but I am not seeing the issue. If you are
>>>>>>> receiving packets on RX 0 then the RSS must have been disabled.
>>>>>>> Can you please check?
>>>>>>>
>>>>>>
>>>>>> # ethtool -i enp2s0f0
>>>>>> driver: tg3
>>>>>> version: 3.137
>>>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
>>>>>> expansion-rom-version:
>>>>>> bus-info: 0000:02:00.0
>>>>>> supports-statistics: yes
>>>>>> supports-test: yes
>>>>>> supports-eeprom-access: yes
>>>>>> supports-register-dump: yes
>>>>>> supports-priv-flags: no
>>>>>>
>>>>>> # ethtool -l enp2s0f0
>>>>>> Channel parameters for enp2s0f0:
>>>>>> Pre-set maximums:
>>>>>> RX:             4
>>>>>> TX:             4
>>>>>> Other:          0
>>>>>> Combined:       0
>>>>>> Current hardware settings:
>>>>>> RX:             4
>>>>>> TX:             1
>>>>>> Other:          0
>>>>>> Combined:       0
>>>>>>
>>>>>> # ethtool -x enp2s0f0
>>>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>>>>>>         0:      0     1     2     3     0     1     2     3
>>>>>>         8:      0     1     2     3     0     1     2     3
>>>>>>        16:      0     1     2     3     0     1     2     3
>>>>>>        24:      0     1     2     3     0     1     2     3
>>>>>>        32:      0     1     2     3     0     1     2     3
>>>>>>        40:      0     1     2     3     0     1     2     3
>>>>>>        48:      0     1     2     3     0     1     2     3
>>>>>>        56:      0     1     2     3     0     1     2     3
>>>>>>        64:      0     1     2     3     0     1     2     3
>>>>>>        72:      0     1     2     3     0     1     2     3
>>>>>>        80:      0     1     2     3     0     1     2     3
>>>>>>        88:      0     1     2     3     0     1     2     3
>>>>>>        96:      0     1     2     3     0     1     2     3
>>>>>>       104:      0     1     2     3     0     1     2     3
>>>>>>       112:      0     1     2     3     0     1     2     3
>>>>>>       120:      0     1     2     3     0     1     2     3
>>>>>> RSS hash key:
>>>>>> Operation not supported
>>>>>> RSS hash function:
>>>>>>         toeplitz: on
>>>>>>         xor: off
>>>>>>         crc32: off
>>>>>>
>>>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
>>>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
>>>>>> different CPU cores. Kernel also has "threadirqs" enabled in
>>>>>> command line, I didn't check if this parameter affects the issue.
>>>>>>
>>>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
>>>>>> in terminology, what I meant:
>>>>>>      - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>>>>>>        RX0 is the first queue/ring that actually receives the traffic.
>>>>>>        RX0 is handled by enp2s0f0-rx-1 interrupt.
>>>>>>      - These are related to (tp->napi[i]), but i is in 1..4, so the first
>>>>>>        receiving queue relates to tp->napi[1], the second relates to
>>>>>>        tp->napi[2], and so on. Correct?
>>>>>>
>>>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
>>>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
>>>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
>>>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
>>>>>> on a currect CPU, which is designated for tp->napi[2], but not for
>>>>>> tp->napi[1]. Correct?
>>>>>>
>>>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
>>>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
>>>>>> suspect something will break badly if I simply remove it without
>>>>>> replacing with something more elaborate. I guess along with re-filling
>>>>>> rx BD producer ring it also can process incoming packets. Is it possible?
>>>>>>
>>>>>
>>>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
>>>>> won't process
>>>>> any rx packets because the producer index of napi[1] has not changed. If the
>>>>> producer count did change, then we get a poll from the ISR for napi[1]
>>>>> to process
>>>>> packets. So it is mostly used to re-fill rx buffers when called
>>>>> explicitly. However
>>>>> there could be a small window where the prod index is incremented but the ISR
>>>>> is not fired yet. It may process some small no of packets. But I don't
>>>>> think this
>>>>> should lead to a reorder problem.
>>>>>
>>>>
>>>> I tried to reproduce without using bridge and veth interfaces, and it seems
>>>> like it's not reproducible, so traffic forwarding via a bridge interface may
>>>> be necessary. It also does not happen if traffic load is low, but moderate
>>>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
>>>> to reproduce with a higher load.
>>>>
>>>> With about the same setup as in an original message (bridge + veth 2
>>>> network namespaces), irqbalance daemon stopped, if traffic flows via
>>>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
>>>> still gets some interrupts, but at a much lower rate compared to 2 and
>>>> 4.
>>>>
>>>> namespace 1:
>>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>      [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
>>>>      [  4] Sent 55508325 datagrams
>>>>
>>>>      iperf Done.
>>>>
>>>> namespace 2:
>>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>      [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
>>>>      [  4] Sent 56414001 datagrams
>>>>
>>>>      iperf Done.
>>>>
>>>>
>>>> With the same configuration but different IP address so that instead of
>>>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
>>>>
>>>>
>>>> namespace 1 (client IP was changed):
>>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>      [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
>>>>      [  4] Sent 52172059 datagrams
>>>>      [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
>>>>
>>>>      iperf Done.
>>>>
>>>> namespace 2:
>>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>      [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
>>>>      [  4] Sent 54463973 datagrams
>>>>
>>>>      iperf Done.
>>>>
>>>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
>>>> it's following the IP address.
>>>>
>>>>
>>>> Is there something I could check to confirm that this behavior is or is not
>>>> related to napi_schedule(&tp->napi[1].napi) call?
>>>
>>> in the function tg3_msi_1shot() you could store the cpu assigned to
>>> tnapi1 (inside the struct tg3_napi)
>>> and then in tg3_poll_work() you can add another check after
>>>           if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
>>> something like
>>> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
>>> only then execute tg3_rx()
>>>
>>> This may stop tnapi 1 from reading rx pkts on the current CPU from
>>> which refill is called.
>>>
>>
>> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
>> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
>> work - traffic does not flow.
>>
> 
> I think the easiest way is to modify the tg3_rx() calling condition
> like below inside
> tg3_poll_work() :
> 
> if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
>          if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
> !tp->rx_refill)) {
>                          work_done += tg3_rx(tnapi, budget - work_done);
>          }
> }
> 
> This will prevent reading rx packets when napi[1] is scheduled only for refill.
> Can you see if this works?
> 

It doesn't hang and can receive the traffic with this change, but I don't see
a difference. I'm suspectig that tg3_poll_work() is called again, maybe in tg3_poll_msix(),
and refill happens first, and then packets are processed anyway.

+static int tg3_cc;
+module_param(tg3_cc, int, 0644);
+MODULE_PARM_DESC(tg3_cc, "cpu check");
+
...
+		if (tnapi->assigned_cpu != smp_processor_id())
+			net_dbg_ratelimited("tg3 refill %d budget %d napi %ld cpu %d %d",
+			    tp->rx_refill, budget,
+			    tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
  		napi_gro_receive(&tnapi->napi, skb);
  
...
+        if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
+                if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] && !tp->rx_refill) || (tg3_cc == 0)) {
+                        work_done += tg3_rx(tnapi, budget - work_done);
+                }
+        }

with tg3_cc set to 1:

[212915.661886] net_ratelimit: 650710 callbacks suppressed
[212915.661889] tg3 refill 0 budget 64 napi 1 cpu 0 3
[212915.661890] tg3 refill 0 budget 63 napi 1 cpu 0 3
[212915.661891] tg3 refill 0 budget 62 napi 1 cpu 0 3
[212915.661892] tg3 refill 0 budget 61 napi 1 cpu 0 3
[212915.661893] tg3 refill 0 budget 60 napi 1 cpu 0 3
[212915.661915] tg3 refill 0 budget 64 napi 1 cpu 0 3
[212915.661916] tg3 refill 0 budget 63 napi 1 cpu 0 3
[212915.661917] tg3 refill 0 budget 62 napi 1 cpu 0 3
[212915.661918] tg3 refill 0 budget 61 napi 1 cpu 0 3
[212915.661919] tg3 refill 0 budget 60 napi 1 cpu 0 3
[212920.665912] net_ratelimit: 251117 callbacks suppressed
[212920.665914] tg3 refill 0 budget 64 napi 1 cpu 0 3
[212920.665915] tg3 refill 0 budget 63 napi 1 cpu 0 3
[212920.665917] tg3 refill 0 budget 62 napi 1 cpu 0 3
[212920.665918] tg3 refill 0 budget 61 napi 1 cpu 0 3
[212920.665919] tg3 refill 0 budget 60 napi 1 cpu 0 3
[212920.665932] tg3 refill 0 budget 64 napi 1 cpu 0 3
[212920.665933] tg3 refill 0 budget 63 napi 1 cpu 0 3
[212920.665935] tg3 refill 0 budget 62 napi 1 cpu 0 3
[212920.665936] tg3 refill 0 budget 61 napi 1 cpu 0 3
[212920.665937] tg3 refill 0 budget 60 napi 1 cpu 0 3

and with tg3_cc set to 0:

[213686.689867] tg3 refill 1 budget 64 napi 1 cpu 0 3
[213686.689869] tg3 refill 1 budget 63 napi 1 cpu 0 3
[213686.689870] tg3 refill 1 budget 62 napi 1 cpu 0 3
[213686.689871] tg3 refill 1 budget 61 napi 1 cpu 0 3
[213686.689872] tg3 refill 1 budget 60 napi 1 cpu 0 3
[213686.689890] tg3 refill 0 budget 64 napi 1 cpu 0 3
[213686.689891] tg3 refill 0 budget 63 napi 1 cpu 0 3
[213686.689892] tg3 refill 0 budget 62 napi 1 cpu 0 3
[213686.689893] tg3 refill 0 budget 61 napi 1 cpu 0 3

affinity:
echo 1 > /proc/irq/106/smp_affinity  # enp2s0f0-rx-1
echo 2 > /proc/irq/107/smp_affinity  # enp2s0f0-rx-2
echo 4 > /proc/irq/108/smp_affinity  # enp2s0f0-rx-3
echo 8 > /proc/irq/109/smp_affinity  # enp2s0f0-rx-4

-- 
Thanks
Vitalii


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-11-01  8:20                 ` Vitaly Bursov
@ 2021-11-01  9:10                   ` Pavan Chebbi
  2021-11-01 10:17                     ` Vitaly Bursov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavan Chebbi @ 2021-11-01  9:10 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List

On Mon, Nov 1, 2021 at 1:50 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>
>
>
> 01.11.2021 09:06, Pavan Chebbi wrote:
> > On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>
> >>
> >>
> >> 29.10.2021 08:04, Pavan Chebbi пишет:
> >>> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>
> >>>>
> >>>> 28.10.2021 10:33, Pavan Chebbi wrote:
> >>>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>>>>>> <siva.kallam@broadcom.com> wrote:
> >>>>>>>>
> >>>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>>>>>> We will provide our feedback very soon on this.
> >>>>>>>>
> >>>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>>>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>>>>>> earlier timestamp compared to a previous packet:
> >>>>>>>>>          1.00s packet1
> >>>>>>>>>          1.20s packet3
> >>>>>>>>>          1.10s packet2
> >>>>>>>>>          1.30s packet4
> >>>>>>>>>
> >>>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>>>>>> computational application work.
> >>>>>>>>>
> >>>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>>>>>
> >>>>>>>>> After further investigation, I believe that there are two separate
> >>>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>>>>>> unicast UDP.
> >>>>>>>>>
> >>>>>>>>> Here are the details of how I understand this behavior.
> >>>>>>>>>
> >>>>>>>>> 1. Packet re-order.
> >>>>>>>>>
> >>>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>>>>>> handles RX queue 0 too:
> >>>>>>>>>
> >>>>>>>>>          https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>>>>>
> >>>>>>>>>          static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>>>>>          {
> >>>>>>>>>                  struct tg3 *tp = tnapi->tp;
> >>>>>>>>>
> >>>>>>>>>                  ...
> >>>>>>>>>
> >>>>>>>>>                  /* Refill RX ring(s). */
> >>>>>>>>>                  if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>>>>>                          ....
> >>>>>>>>>                  } else if (work_mask) {
> >>>>>>>>>                          ...
> >>>>>>>>>
> >>>>>>>>>                          if (tnapi != &tp->napi[1]) {
> >>>>>>>>>                                  tp->rx_refill = true;
> >>>>>>>>>                                  napi_schedule(&tp->napi[1].napi);
> >>>>>>>>>                          }
> >>>>>>>>>                  }
> >>>>>>>>>                  ...
> >>>>>>>>>          }
> >>>>>>>>>
> >>>>>>>>>      From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>>>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>>>>>
> >>>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>>>>>> from queue 0, the second flow will work fine.
> >>>>>>>>>
> >>>>>>>>> No idea how to fix this.
> >>>>>>>
> >>>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>>>>>> The napi of those rings are indeed processing the packets.
> >>>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>>>>>> producer ring because it is shared with return rings for 1-4.
> >>>>>>> I tried to repro this but I am not seeing the issue. If you are
> >>>>>>> receiving packets on RX 0 then the RSS must have been disabled.
> >>>>>>> Can you please check?
> >>>>>>>
> >>>>>>
> >>>>>> # ethtool -i enp2s0f0
> >>>>>> driver: tg3
> >>>>>> version: 3.137
> >>>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >>>>>> expansion-rom-version:
> >>>>>> bus-info: 0000:02:00.0
> >>>>>> supports-statistics: yes
> >>>>>> supports-test: yes
> >>>>>> supports-eeprom-access: yes
> >>>>>> supports-register-dump: yes
> >>>>>> supports-priv-flags: no
> >>>>>>
> >>>>>> # ethtool -l enp2s0f0
> >>>>>> Channel parameters for enp2s0f0:
> >>>>>> Pre-set maximums:
> >>>>>> RX:             4
> >>>>>> TX:             4
> >>>>>> Other:          0
> >>>>>> Combined:       0
> >>>>>> Current hardware settings:
> >>>>>> RX:             4
> >>>>>> TX:             1
> >>>>>> Other:          0
> >>>>>> Combined:       0
> >>>>>>
> >>>>>> # ethtool -x enp2s0f0
> >>>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>>>>>         0:      0     1     2     3     0     1     2     3
> >>>>>>         8:      0     1     2     3     0     1     2     3
> >>>>>>        16:      0     1     2     3     0     1     2     3
> >>>>>>        24:      0     1     2     3     0     1     2     3
> >>>>>>        32:      0     1     2     3     0     1     2     3
> >>>>>>        40:      0     1     2     3     0     1     2     3
> >>>>>>        48:      0     1     2     3     0     1     2     3
> >>>>>>        56:      0     1     2     3     0     1     2     3
> >>>>>>        64:      0     1     2     3     0     1     2     3
> >>>>>>        72:      0     1     2     3     0     1     2     3
> >>>>>>        80:      0     1     2     3     0     1     2     3
> >>>>>>        88:      0     1     2     3     0     1     2     3
> >>>>>>        96:      0     1     2     3     0     1     2     3
> >>>>>>       104:      0     1     2     3     0     1     2     3
> >>>>>>       112:      0     1     2     3     0     1     2     3
> >>>>>>       120:      0     1     2     3     0     1     2     3
> >>>>>> RSS hash key:
> >>>>>> Operation not supported
> >>>>>> RSS hash function:
> >>>>>>         toeplitz: on
> >>>>>>         xor: off
> >>>>>>         crc32: off
> >>>>>>
> >>>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >>>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >>>>>> different CPU cores. Kernel also has "threadirqs" enabled in
> >>>>>> command line, I didn't check if this parameter affects the issue.
> >>>>>>
> >>>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
> >>>>>> in terminology, what I meant:
> >>>>>>      - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>>>>>        RX0 is the first queue/ring that actually receives the traffic.
> >>>>>>        RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>>>>>      - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>>>>>        receiving queue relates to tp->napi[1], the second relates to
> >>>>>>        tp->napi[2], and so on. Correct?
> >>>>>>
> >>>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >>>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >>>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >>>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >>>>>> on a currect CPU, which is designated for tp->napi[2], but not for
> >>>>>> tp->napi[1]. Correct?
> >>>>>>
> >>>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >>>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >>>>>> suspect something will break badly if I simply remove it without
> >>>>>> replacing with something more elaborate. I guess along with re-filling
> >>>>>> rx BD producer ring it also can process incoming packets. Is it possible?
> >>>>>>
> >>>>>
> >>>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> >>>>> won't process
> >>>>> any rx packets because the producer index of napi[1] has not changed. If the
> >>>>> producer count did change, then we get a poll from the ISR for napi[1]
> >>>>> to process
> >>>>> packets. So it is mostly used to re-fill rx buffers when called
> >>>>> explicitly. However
> >>>>> there could be a small window where the prod index is incremented but the ISR
> >>>>> is not fired yet. It may process some small no of packets. But I don't
> >>>>> think this
> >>>>> should lead to a reorder problem.
> >>>>>
> >>>>
> >>>> I tried to reproduce without using bridge and veth interfaces, and it seems
> >>>> like it's not reproducible, so traffic forwarding via a bridge interface may
> >>>> be necessary. It also does not happen if traffic load is low, but moderate
> >>>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> >>>> to reproduce with a higher load.
> >>>>
> >>>> With about the same setup as in an original message (bridge + veth 2
> >>>> network namespaces), irqbalance daemon stopped, if traffic flows via
> >>>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> >>>> still gets some interrupts, but at a much lower rate compared to 2 and
> >>>> 4.
> >>>>
> >>>> namespace 1:
> >>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
> >>>>      [  4] Sent 55508325 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> namespace 2:
> >>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
> >>>>      [  4] Sent 56414001 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>>
> >>>> With the same configuration but different IP address so that instead of
> >>>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
> >>>>
> >>>>
> >>>> namespace 1 (client IP was changed):
> >>>>      # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
> >>>>      [  4] Sent 52172059 datagrams
> >>>>      [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> namespace 2:
> >>>>      # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>      - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>      [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>      [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
> >>>>      [  4] Sent 54463973 datagrams
> >>>>
> >>>>      iperf Done.
> >>>>
> >>>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> >>>> it's following the IP address.
> >>>>
> >>>>
> >>>> Is there something I could check to confirm that this behavior is or is not
> >>>> related to napi_schedule(&tp->napi[1].napi) call?
> >>>
> >>> in the function tg3_msi_1shot() you could store the cpu assigned to
> >>> tnapi1 (inside the struct tg3_napi)
> >>> and then in tg3_poll_work() you can add another check after
> >>>           if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> >>> something like
> >>> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> >>> only then execute tg3_rx()
> >>>
> >>> This may stop tnapi 1 from reading rx pkts on the current CPU from
> >>> which refill is called.
> >>>
> >>
> >> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
> >> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
> >> work - traffic does not flow.
> >>
> >
> > I think the easiest way is to modify the tg3_rx() calling condition
> > like below inside
> > tg3_poll_work() :
> >
> > if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
> >          if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
> > !tp->rx_refill)) {
> >                          work_done += tg3_rx(tnapi, budget - work_done);
> >          }
> > }
> >
> > This will prevent reading rx packets when napi[1] is scheduled only for refill.
> > Can you see if this works?
> >
>
> It doesn't hang and can receive the traffic with this change, but I don't see
> a difference. I'm suspectig that tg3_poll_work() is called again, maybe in tg3_poll_msix(),
> and refill happens first, and then packets are processed anyway.
>

OK I see it now. Let me try this out myself. Will get back on this.
However, can you see with your debug prints if there is any correlation
between the time and number of prints where napi 1 is reading packets
on unassigned CPU to the time and number of packets you received
out of order up the stack? Do they match with each other? If not, we may be
incorrectly suspecting napi1 here.

> +static int tg3_cc;
> +module_param(tg3_cc, int, 0644);
> +MODULE_PARM_DESC(tg3_cc, "cpu check");
> +
> ...
> +               if (tnapi->assigned_cpu != smp_processor_id())
> +                       net_dbg_ratelimited("tg3 refill %d budget %d napi %ld cpu %d %d",
> +                           tp->rx_refill, budget,
> +                           tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
>                 napi_gro_receive(&tnapi->napi, skb);
>
> ...
> +        if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
> +                if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] && !tp->rx_refill) || (tg3_cc == 0)) {
> +                        work_done += tg3_rx(tnapi, budget - work_done);
> +                }
> +        }
>
> with tg3_cc set to 1:
>
> [212915.661886] net_ratelimit: 650710 callbacks suppressed
> [212915.661889] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212915.661890] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212915.661891] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212915.661892] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212915.661893] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212915.661915] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212915.661916] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212915.661917] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212915.661918] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212915.661919] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212920.665912] net_ratelimit: 251117 callbacks suppressed
> [212920.665914] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212920.665915] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212920.665917] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212920.665918] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212920.665919] tg3 refill 0 budget 60 napi 1 cpu 0 3
> [212920.665932] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [212920.665933] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [212920.665935] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [212920.665936] tg3 refill 0 budget 61 napi 1 cpu 0 3
> [212920.665937] tg3 refill 0 budget 60 napi 1 cpu 0 3
>
> and with tg3_cc set to 0:
>
> [213686.689867] tg3 refill 1 budget 64 napi 1 cpu 0 3
> [213686.689869] tg3 refill 1 budget 63 napi 1 cpu 0 3
> [213686.689870] tg3 refill 1 budget 62 napi 1 cpu 0 3
> [213686.689871] tg3 refill 1 budget 61 napi 1 cpu 0 3
> [213686.689872] tg3 refill 1 budget 60 napi 1 cpu 0 3
> [213686.689890] tg3 refill 0 budget 64 napi 1 cpu 0 3
> [213686.689891] tg3 refill 0 budget 63 napi 1 cpu 0 3
> [213686.689892] tg3 refill 0 budget 62 napi 1 cpu 0 3
> [213686.689893] tg3 refill 0 budget 61 napi 1 cpu 0 3
>
> affinity:
> echo 1 > /proc/irq/106/smp_affinity  # enp2s0f0-rx-1
> echo 2 > /proc/irq/107/smp_affinity  # enp2s0f0-rx-2
> echo 4 > /proc/irq/108/smp_affinity  # enp2s0f0-rx-3
> echo 8 > /proc/irq/109/smp_affinity  # enp2s0f0-rx-4
>
> --
> Thanks
> Vitalii
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-11-01  9:10                   ` Pavan Chebbi
@ 2021-11-01 10:17                     ` Vitaly Bursov
  2022-09-21 19:04                       ` Etienne Champetier
  0 siblings, 1 reply; 13+ messages in thread
From: Vitaly Bursov @ 2021-11-01 10:17 UTC (permalink / raw)
  To: Pavan Chebbi
  Cc: Siva Reddy Kallam, Prashant Sreedharan, Michael Chan, Linux Netdev List



01.11.2021 11:10, Pavan Chebbi wrote:
> On Mon, Nov 1, 2021 at 1:50 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>
>>
>>
>> 01.11.2021 09:06, Pavan Chebbi wrote:
>>> On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>
>>>>
>>>>
>>>> 29.10.2021 08:04, Pavan Chebbi пишет:
>>>>> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>
>>>>>>
>>>>>> 28.10.2021 10:33, Pavan Chebbi wrote:
>>>>>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
>>>>>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
>>>>>>>>> <siva.kallam@broadcom.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
>>>>>>>>>> We will provide our feedback very soon on this.
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
>>>>>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
>>>>>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
>>>>>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
>>>>>>>>>>> a host, re-order in a container at the same time). In a pcap file
>>>>>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
>>>>>>>>>>> earlier timestamp compared to a previous packet:
>>>>>>>>>>>           1.00s packet1
>>>>>>>>>>>           1.20s packet3
>>>>>>>>>>>           1.10s packet2
>>>>>>>>>>>           1.30s packet4
>>>>>>>>>>>
>>>>>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
>>>>>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
>>>>>>>>>>> computational application work.
>>>>>>>>>>>
>>>>>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
>>>>>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
>>>>>>>>>>>
>>>>>>>>>>> After further investigation, I believe that there are two separate
>>>>>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
>>>>>>>>>>> unicast UDP.
>>>>>>>>>>>
>>>>>>>>>>> Here are the details of how I understand this behavior.
>>>>>>>>>>>
>>>>>>>>>>> 1. Packet re-order.
>>>>>>>>>>>
>>>>>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
>>>>>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
>>>>>>>>>>> handles RX queue 0 too:
>>>>>>>>>>>
>>>>>>>>>>>           https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
>>>>>>>>>>>
>>>>>>>>>>>           static int tg3_rx(struct tg3_napi *tnapi, int budget)
>>>>>>>>>>>           {
>>>>>>>>>>>                   struct tg3 *tp = tnapi->tp;
>>>>>>>>>>>
>>>>>>>>>>>                   ...
>>>>>>>>>>>
>>>>>>>>>>>                   /* Refill RX ring(s). */
>>>>>>>>>>>                   if (!tg3_flag(tp, ENABLE_RSS)) {
>>>>>>>>>>>                           ....
>>>>>>>>>>>                   } else if (work_mask) {
>>>>>>>>>>>                           ...
>>>>>>>>>>>
>>>>>>>>>>>                           if (tnapi != &tp->napi[1]) {
>>>>>>>>>>>                                   tp->rx_refill = true;
>>>>>>>>>>>                                   napi_schedule(&tp->napi[1].napi);
>>>>>>>>>>>                           }
>>>>>>>>>>>                   }
>>>>>>>>>>>                   ...
>>>>>>>>>>>           }
>>>>>>>>>>>
>>>>>>>>>>>       From napi_schedule() code, it should schedure RX 0 traffic handling on
>>>>>>>>>>> a current CPU, which handles queues RX1-3 right now.
>>>>>>>>>>>
>>>>>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
>>>>>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
>>>>>>>>>>> from queue 0, the second flow will work fine.
>>>>>>>>>>>
>>>>>>>>>>> No idea how to fix this.
>>>>>>>>>
>>>>>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
>>>>>>>>> The napi of those rings are indeed processing the packets.
>>>>>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
>>>>>>>>> producer ring because it is shared with return rings for 1-4.
>>>>>>>>> I tried to repro this but I am not seeing the issue. If you are
>>>>>>>>> receiving packets on RX 0 then the RSS must have been disabled.
>>>>>>>>> Can you please check?
>>>>>>>>>
>>>>>>>>
>>>>>>>> # ethtool -i enp2s0f0
>>>>>>>> driver: tg3
>>>>>>>> version: 3.137
>>>>>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
>>>>>>>> expansion-rom-version:
>>>>>>>> bus-info: 0000:02:00.0
>>>>>>>> supports-statistics: yes
>>>>>>>> supports-test: yes
>>>>>>>> supports-eeprom-access: yes
>>>>>>>> supports-register-dump: yes
>>>>>>>> supports-priv-flags: no
>>>>>>>>
>>>>>>>> # ethtool -l enp2s0f0
>>>>>>>> Channel parameters for enp2s0f0:
>>>>>>>> Pre-set maximums:
>>>>>>>> RX:             4
>>>>>>>> TX:             4
>>>>>>>> Other:          0
>>>>>>>> Combined:       0
>>>>>>>> Current hardware settings:
>>>>>>>> RX:             4
>>>>>>>> TX:             1
>>>>>>>> Other:          0
>>>>>>>> Combined:       0
>>>>>>>>
>>>>>>>> # ethtool -x enp2s0f0
>>>>>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
>>>>>>>>          0:      0     1     2     3     0     1     2     3
>>>>>>>>          8:      0     1     2     3     0     1     2     3
>>>>>>>>         16:      0     1     2     3     0     1     2     3
>>>>>>>>         24:      0     1     2     3     0     1     2     3
>>>>>>>>         32:      0     1     2     3     0     1     2     3
>>>>>>>>         40:      0     1     2     3     0     1     2     3
>>>>>>>>         48:      0     1     2     3     0     1     2     3
>>>>>>>>         56:      0     1     2     3     0     1     2     3
>>>>>>>>         64:      0     1     2     3     0     1     2     3
>>>>>>>>         72:      0     1     2     3     0     1     2     3
>>>>>>>>         80:      0     1     2     3     0     1     2     3
>>>>>>>>         88:      0     1     2     3     0     1     2     3
>>>>>>>>         96:      0     1     2     3     0     1     2     3
>>>>>>>>        104:      0     1     2     3     0     1     2     3
>>>>>>>>        112:      0     1     2     3     0     1     2     3
>>>>>>>>        120:      0     1     2     3     0     1     2     3
>>>>>>>> RSS hash key:
>>>>>>>> Operation not supported
>>>>>>>> RSS hash function:
>>>>>>>>          toeplitz: on
>>>>>>>>          xor: off
>>>>>>>>          crc32: off
>>>>>>>>
>>>>>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
>>>>>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
>>>>>>>> different CPU cores. Kernel also has "threadirqs" enabled in
>>>>>>>> command line, I didn't check if this parameter affects the issue.
>>>>>>>>
>>>>>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
>>>>>>>> in terminology, what I meant:
>>>>>>>>       - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
>>>>>>>>         RX0 is the first queue/ring that actually receives the traffic.
>>>>>>>>         RX0 is handled by enp2s0f0-rx-1 interrupt.
>>>>>>>>       - These are related to (tp->napi[i]), but i is in 1..4, so the first
>>>>>>>>         receiving queue relates to tp->napi[1], the second relates to
>>>>>>>>         tp->napi[2], and so on. Correct?
>>>>>>>>
>>>>>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
>>>>>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
>>>>>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
>>>>>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
>>>>>>>> on a currect CPU, which is designated for tp->napi[2], but not for
>>>>>>>> tp->napi[1]. Correct?
>>>>>>>>
>>>>>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
>>>>>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
>>>>>>>> suspect something will break badly if I simply remove it without
>>>>>>>> replacing with something more elaborate. I guess along with re-filling
>>>>>>>> rx BD producer ring it also can process incoming packets. Is it possible?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
>>>>>>> won't process
>>>>>>> any rx packets because the producer index of napi[1] has not changed. If the
>>>>>>> producer count did change, then we get a poll from the ISR for napi[1]
>>>>>>> to process
>>>>>>> packets. So it is mostly used to re-fill rx buffers when called
>>>>>>> explicitly. However
>>>>>>> there could be a small window where the prod index is incremented but the ISR
>>>>>>> is not fired yet. It may process some small no of packets. But I don't
>>>>>>> think this
>>>>>>> should lead to a reorder problem.
>>>>>>>
>>>>>>
>>>>>> I tried to reproduce without using bridge and veth interfaces, and it seems
>>>>>> like it's not reproducible, so traffic forwarding via a bridge interface may
>>>>>> be necessary. It also does not happen if traffic load is low, but moderate
>>>>>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
>>>>>> to reproduce with a higher load.
>>>>>>
>>>>>> With about the same setup as in an original message (bridge + veth 2
>>>>>> network namespaces), irqbalance daemon stopped, if traffic flows via
>>>>>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
>>>>>> still gets some interrupts, but at a much lower rate compared to 2 and
>>>>>> 4.
>>>>>>
>>>>>> namespace 1:
>>>>>>       # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>>>       [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
>>>>>>       [  4] Sent 55508325 datagrams
>>>>>>
>>>>>>       iperf Done.
>>>>>>
>>>>>> namespace 2:
>>>>>>       # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>>>       [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
>>>>>>       [  4] Sent 56414001 datagrams
>>>>>>
>>>>>>       iperf Done.
>>>>>>
>>>>>>
>>>>>> With the same configuration but different IP address so that instead of
>>>>>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
>>>>>>
>>>>>>
>>>>>> namespace 1 (client IP was changed):
>>>>>>       # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
>>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>>>       [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
>>>>>>       [  4] Sent 52172059 datagrams
>>>>>>       [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
>>>>>>
>>>>>>       iperf Done.
>>>>>>
>>>>>> namespace 2:
>>>>>>       # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
>>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>>>>>>       [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
>>>>>>       [  4] Sent 54463973 datagrams
>>>>>>
>>>>>>       iperf Done.
>>>>>>
>>>>>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
>>>>>> it's following the IP address.
>>>>>>
>>>>>>
>>>>>> Is there something I could check to confirm that this behavior is or is not
>>>>>> related to napi_schedule(&tp->napi[1].napi) call?
>>>>>
>>>>> in the function tg3_msi_1shot() you could store the cpu assigned to
>>>>> tnapi1 (inside the struct tg3_napi)
>>>>> and then in tg3_poll_work() you can add another check after
>>>>>            if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
>>>>> something like
>>>>> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
>>>>> only then execute tg3_rx()
>>>>>
>>>>> This may stop tnapi 1 from reading rx pkts on the current CPU from
>>>>> which refill is called.
>>>>>
>>>>
>>>> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
>>>> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
>>>> work - traffic does not flow.
>>>>
>>>
>>> I think the easiest way is to modify the tg3_rx() calling condition
>>> like below inside
>>> tg3_poll_work() :
>>>
>>> if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
>>>           if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
>>> !tp->rx_refill)) {
>>>                           work_done += tg3_rx(tnapi, budget - work_done);
>>>           }
>>> }
>>>
>>> This will prevent reading rx packets when napi[1] is scheduled only for refill.
>>> Can you see if this works?
>>>
>>
>> It doesn't hang and can receive the traffic with this change, but I don't see
>> a difference. I'm suspectig that tg3_poll_work() is called again, maybe in tg3_poll_msix(),
>> and refill happens first, and then packets are processed anyway.
>>
> 
> OK I see it now. Let me try this out myself. Will get back on this.
> However, can you see with your debug prints if there is any correlation
> between the time and number of prints where napi 1 is reading packets
> on unassigned CPU to the time and number of packets you received
> out of order up the stack? Do they match with each other? If not, we may be
> incorrectly suspecting napi1 here.
> 

No corellation that I can see - reordered packets are received sometimes -
10000 in 300 seconds in this test, but napi messages are logged and
rate-limited at about 100000 per second. If bandwidth is very low, then
there are no messages and no reordering. Not sure if I can isolate these
events specifically.

-- 
Thanks
Vitalii


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: tg3 RX packet re-order in queue 0 with RSS
  2021-11-01 10:17                     ` Vitaly Bursov
@ 2022-09-21 19:04                       ` Etienne Champetier
  0 siblings, 0 replies; 13+ messages in thread
From: Etienne Champetier @ 2022-09-21 19:04 UTC (permalink / raw)
  To: Vitaly Bursov
  Cc: Pavan Chebbi, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, Linux Netdev List

Le lun. 1 nov. 2021 à 06:17, Vitaly Bursov <vitaly@bursov.com> a écrit :
>
>
>
> 01.11.2021 11:10, Pavan Chebbi wrote:
> > On Mon, Nov 1, 2021 at 1:50 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>
> >>
> >>
> >> 01.11.2021 09:06, Pavan Chebbi wrote:
> >>> On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 29.10.2021 08:04, Pavan Chebbi пишет:
> >>>>> 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> 28.10.2021 10:33, Pavan Chebbi wrote:
> >>>>>>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>>>>>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>>>>>>>> <siva.kallam@broadcom.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>>>>>>>> We will provide our feedback very soon on this.
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@bursov.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>>>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>>>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>>>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>>>>>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>>>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>>>>>>>> earlier timestamp compared to a previous packet:
> >>>>>>>>>>>           1.00s packet1
> >>>>>>>>>>>           1.20s packet3
> >>>>>>>>>>>           1.10s packet2
> >>>>>>>>>>>           1.30s packet4
> >>>>>>>>>>>
> >>>>>>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>>>>>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>>>>>>>> computational application work.
> >>>>>>>>>>>
> >>>>>>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>>>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>>>>>>>
> >>>>>>>>>>> After further investigation, I believe that there are two separate
> >>>>>>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>>>>>>>> unicast UDP.
> >>>>>>>>>>>
> >>>>>>>>>>> Here are the details of how I understand this behavior.
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Packet re-order.
> >>>>>>>>>>>
> >>>>>>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>>>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>>>>>>>> handles RX queue 0 too:
> >>>>>>>>>>>
> >>>>>>>>>>>           https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>>>>>>>
> >>>>>>>>>>>           static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>>>>>>>           {
> >>>>>>>>>>>                   struct tg3 *tp = tnapi->tp;
> >>>>>>>>>>>
> >>>>>>>>>>>                   ...
> >>>>>>>>>>>
> >>>>>>>>>>>                   /* Refill RX ring(s). */
> >>>>>>>>>>>                   if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>>>>>>>                           ....
> >>>>>>>>>>>                   } else if (work_mask) {
> >>>>>>>>>>>                           ...
> >>>>>>>>>>>
> >>>>>>>>>>>                           if (tnapi != &tp->napi[1]) {
> >>>>>>>>>>>                                   tp->rx_refill = true;
> >>>>>>>>>>>                                   napi_schedule(&tp->napi[1].napi);
> >>>>>>>>>>>                           }
> >>>>>>>>>>>                   }
> >>>>>>>>>>>                   ...
> >>>>>>>>>>>           }
> >>>>>>>>>>>
> >>>>>>>>>>>       From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>>>>>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>>>>>>>
> >>>>>>>>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>>>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>>>>>>>> from queue 0, the second flow will work fine.
> >>>>>>>>>>>
> >>>>>>>>>>> No idea how to fix this.
> >>>>>>>>>
> >>>>>>>>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>>>>>>>> The napi of those rings are indeed processing the packets.
> >>>>>>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>>>>>>>> producer ring because it is shared with return rings for 1-4.
> >>>>>>>>> I tried to repro this but I am not seeing the issue. If you are
> >>>>>>>>> receiving packets on RX 0 then the RSS must have been disabled.
> >>>>>>>>> Can you please check?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> # ethtool -i enp2s0f0
> >>>>>>>> driver: tg3
> >>>>>>>> version: 3.137
> >>>>>>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >>>>>>>> expansion-rom-version:
> >>>>>>>> bus-info: 0000:02:00.0
> >>>>>>>> supports-statistics: yes
> >>>>>>>> supports-test: yes
> >>>>>>>> supports-eeprom-access: yes
> >>>>>>>> supports-register-dump: yes
> >>>>>>>> supports-priv-flags: no
> >>>>>>>>
> >>>>>>>> # ethtool -l enp2s0f0
> >>>>>>>> Channel parameters for enp2s0f0:
> >>>>>>>> Pre-set maximums:
> >>>>>>>> RX:             4
> >>>>>>>> TX:             4
> >>>>>>>> Other:          0
> >>>>>>>> Combined:       0
> >>>>>>>> Current hardware settings:
> >>>>>>>> RX:             4
> >>>>>>>> TX:             1
> >>>>>>>> Other:          0
> >>>>>>>> Combined:       0
> >>>>>>>>
> >>>>>>>> # ethtool -x enp2s0f0
> >>>>>>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>>>>>>>          0:      0     1     2     3     0     1     2     3
> >>>>>>>>          8:      0     1     2     3     0     1     2     3
> >>>>>>>>         16:      0     1     2     3     0     1     2     3
> >>>>>>>>         24:      0     1     2     3     0     1     2     3
> >>>>>>>>         32:      0     1     2     3     0     1     2     3
> >>>>>>>>         40:      0     1     2     3     0     1     2     3
> >>>>>>>>         48:      0     1     2     3     0     1     2     3
> >>>>>>>>         56:      0     1     2     3     0     1     2     3
> >>>>>>>>         64:      0     1     2     3     0     1     2     3
> >>>>>>>>         72:      0     1     2     3     0     1     2     3
> >>>>>>>>         80:      0     1     2     3     0     1     2     3
> >>>>>>>>         88:      0     1     2     3     0     1     2     3
> >>>>>>>>         96:      0     1     2     3     0     1     2     3
> >>>>>>>>        104:      0     1     2     3     0     1     2     3
> >>>>>>>>        112:      0     1     2     3     0     1     2     3
> >>>>>>>>        120:      0     1     2     3     0     1     2     3
> >>>>>>>> RSS hash key:
> >>>>>>>> Operation not supported
> >>>>>>>> RSS hash function:
> >>>>>>>>          toeplitz: on
> >>>>>>>>          xor: off
> >>>>>>>>          crc32: off
> >>>>>>>>
> >>>>>>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >>>>>>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >>>>>>>> different CPU cores. Kernel also has "threadirqs" enabled in
> >>>>>>>> command line, I didn't check if this parameter affects the issue.
> >>>>>>>>
> >>>>>>>> Yes, some things start with 0, and others with 1, sorry for a confusion
> >>>>>>>> in terminology, what I meant:
> >>>>>>>>       - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>>>>>>>         RX0 is the first queue/ring that actually receives the traffic.
> >>>>>>>>         RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>>>>>>>       - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>>>>>>>         receiving queue relates to tp->napi[1], the second relates to
> >>>>>>>>         tp->napi[2], and so on. Correct?
> >>>>>>>>
> >>>>>>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >>>>>>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >>>>>>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >>>>>>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >>>>>>>> on a currect CPU, which is designated for tp->napi[2], but not for
> >>>>>>>> tp->napi[1]. Correct?
> >>>>>>>>
> >>>>>>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >>>>>>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >>>>>>>> suspect something will break badly if I simply remove it without
> >>>>>>>> replacing with something more elaborate. I guess along with re-filling
> >>>>>>>> rx BD producer ring it also can process incoming packets. Is it possible?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> >>>>>>> won't process
> >>>>>>> any rx packets because the producer index of napi[1] has not changed. If the
> >>>>>>> producer count did change, then we get a poll from the ISR for napi[1]
> >>>>>>> to process
> >>>>>>> packets. So it is mostly used to re-fill rx buffers when called
> >>>>>>> explicitly. However
> >>>>>>> there could be a small window where the prod index is incremented but the ISR
> >>>>>>> is not fired yet. It may process some small no of packets. But I don't
> >>>>>>> think this
> >>>>>>> should lead to a reorder problem.
> >>>>>>>
> >>>>>>
> >>>>>> I tried to reproduce without using bridge and veth interfaces, and it seems
> >>>>>> like it's not reproducible, so traffic forwarding via a bridge interface may
> >>>>>> be necessary. It also does not happen if traffic load is low, but moderate
> >>>>>> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> >>>>>> to reproduce with a higher load.
> >>>>>>
> >>>>>> With about the same setup as in an original message (bridge + veth 2
> >>>>>> network namespaces), irqbalance daemon stopped, if traffic flows via
> >>>>>> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> >>>>>> still gets some interrupts, but at a much lower rate compared to 2 and
> >>>>>> 4.
> >>>>>>
> >>>>>> namespace 1:
> >>>>>>       # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>>>       [  4]   0.00-300.00 sec  6.72 GBytes   192 Mbits/sec  0.008 ms  3805/55508325 (0.0069%)
> >>>>>>       [  4] Sent 55508325 datagrams
> >>>>>>
> >>>>>>       iperf Done.
> >>>>>>
> >>>>>> namespace 2:
> >>>>>>       # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>>>       [  4]   0.00-300.00 sec  6.83 GBytes   196 Mbits/sec  0.005 ms  3873/56414001 (0.0069%)
> >>>>>>       [  4] Sent 56414001 datagrams
> >>>>>>
> >>>>>>       iperf Done.
> >>>>>>
> >>>>>>
> >>>>>> With the same configuration but different IP address so that instead of
> >>>>>> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
> >>>>>>
> >>>>>>
> >>>>>> namespace 1 (client IP was changed):
> >>>>>>       # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>>>       [  4]   0.00-300.00 sec  6.32 GBytes   181 Mbits/sec  0.007 ms  8506/52172059 (0.016%)
> >>>>>>       [  4] Sent 52172059 datagrams
> >>>>>>       [SUM]  0.0-300.0 sec  2452 datagrams received out-of-order
> >>>>>>
> >>>>>>       iperf Done.
> >>>>>>
> >>>>>> namespace 2:
> >>>>>>       # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >>>>>>       - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>>       [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> >>>>>>       [  4]   0.00-300.00 sec  6.59 GBytes   189 Mbits/sec  0.006 ms  6302/54463973 (0.012%)
> >>>>>>       [  4] Sent 54463973 datagrams
> >>>>>>
> >>>>>>       iperf Done.
> >>>>>>
> >>>>>> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> >>>>>> it's following the IP address.
> >>>>>>
> >>>>>>
> >>>>>> Is there something I could check to confirm that this behavior is or is not
> >>>>>> related to napi_schedule(&tp->napi[1].napi) call?
> >>>>>
> >>>>> in the function tg3_msi_1shot() you could store the cpu assigned to
> >>>>> tnapi1 (inside the struct tg3_napi)
> >>>>> and then in tg3_poll_work() you can add another check after
> >>>>>            if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> >>>>> something like
> >>>>> if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> >>>>> only then execute tg3_rx()
> >>>>>
> >>>>> This may stop tnapi 1 from reading rx pkts on the current CPU from
> >>>>> which refill is called.
> >>>>>
> >>>>
> >>>> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
> >>>> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
> >>>> work - traffic does not flow.
> >>>>
> >>>
> >>> I think the easiest way is to modify the tg3_rx() calling condition
> >>> like below inside
> >>> tg3_poll_work() :
> >>>
> >>> if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
> >>>           if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
> >>> !tp->rx_refill)) {
> >>>                           work_done += tg3_rx(tnapi, budget - work_done);
> >>>           }
> >>> }
> >>>
> >>> This will prevent reading rx packets when napi[1] is scheduled only for refill.
> >>> Can you see if this works?
> >>>
> >>
> >> It doesn't hang and can receive the traffic with this change, but I don't see
> >> a difference. I'm suspectig that tg3_poll_work() is called again, maybe in tg3_poll_msix(),
> >> and refill happens first, and then packets are processed anyway.
> >>
> >
> > OK I see it now. Let me try this out myself. Will get back on this.
> > However, can you see with your debug prints if there is any correlation
> > between the time and number of prints where napi 1 is reading packets
> > on unassigned CPU to the time and number of packets you received
> > out of order up the stack? Do they match with each other? If not, we may be
> > incorrectly suspecting napi1 here.
> >
>
> No corellation that I can see - reordered packets are received sometimes -
> 10000 in 300 seconds in this test, but napi messages are logged and
> rate-limited at about 100000 per second. If bandwidth is very low, then
> there are no messages and no reordering. Not sure if I can isolate these
> events specifically.

I'm facing the same issue, multicast packet reordering received by tg3
going to a macvlan,
tcpdump on the nic is ok, tcpdump on the macvlan show reordering.
I'm using Alma 8.6, and for me the only fix is to go to 1 RX queue.

Was there another email thread with more progress / was there a fix
outside of tg3.c for this issue ?
(looking at the git log for tg3.c I don't see anything relevant)

Thanks
Etienne

>
> --
> Thanks
> Vitalii
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-09-21 19:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-20 13:29 tg3 RX packet re-order in queue 0 with RSS Vitaly Bursov
2021-09-22  6:40 ` Siva Reddy Kallam
2021-10-27  9:30   ` Pavan Chebbi
2021-10-27 10:31     ` Vitaly Bursov
2021-10-28  7:33       ` Pavan Chebbi
2021-10-28 15:41         ` Vitaly Bursov
2021-10-29  5:04           ` Pavan Chebbi
2021-10-29 15:45             ` Vitaly Bursov
2021-11-01  7:06               ` Pavan Chebbi
2021-11-01  8:20                 ` Vitaly Bursov
2021-11-01  9:10                   ` Pavan Chebbi
2021-11-01 10:17                     ` Vitaly Bursov
2022-09-21 19:04                       ` Etienne Champetier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.