netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tg3 dropping packets at high packet rates
@ 2022-05-18 16:08 David Laight
  2022-05-18 17:27 ` Paolo Abeni
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-18 16:08 UTC (permalink / raw)
  To: netdev; +Cc: 'mchan@broadcom.com', David Miller

I'm trying to see why the tg3 driver is dropping a lot of
receive packets.

(This driver is making my head hurt...)

I think that the rx_packets count (sum of rx_[umb]cast_packets)
is all the packets, but a smaller number are actually processed
by the tg3_rx()
But none of the error counts get increased.

It is almost as if it has lost almost all the receive buffers.

If I read /sys/class/net/em2/statistics/rx_packets every second
delaying with:
  syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
about every 43 seconds I get a zero increment.
This really doesn't help!
I've put a count into tg3_rx() that seems to match what IP/UDP
and the application see.

The traffic flow is pretty horrid (but could be worse).
There are 8000 small UDP packets every 20ms.
These are reasonably spread through the 20ms (not back to back).
All the destination ports are different (8000 receiving sockets).
(The receiving application handles this fine (now).)
The packets come from two different systems.

Firstly RSS doesn't seem to work very well.
With the current driver I think everything hits 2 rings.
With the 3.10 RHEL driver it all ends up in one.

Anyway after a hint from Eric I enabled RPS.
This offloads the IP and UDP processing enough to stop
any of the cpu (only 40 of them) from reporting even 50% busy.

I've also increased the rx ring size to 2047.
Changing the coalescing parameters seems to have no effect.

I think there should be 2047 receive buffers.
So 4 interrupts every 20ms or 200/sec might be enough
to receive all the frames.
The actual interrupt rate (deltas on /proc/interrupts)
is actual over 80000/sec.
So it doesn't look as though the driver is ever processing
many packets/interrupt.
If the driver were getting behind I'd expect a smaller number
of interrupts.

This would be consistent with there only being (say) 8 active
receive buffers.

The device in question identifies as:

tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address xx
tg3 0000:02:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:02:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]

Any idea where to look?

Or should I just use different ethernet hardware!
(Although the interrupt coalescing parameters for igb are
also completely broken for this traffic flow.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-18 16:08 tg3 dropping packets at high packet rates David Laight
@ 2022-05-18 17:27 ` Paolo Abeni
  2022-05-18 21:31   ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2022-05-18 17:27 UTC (permalink / raw)
  To: David Laight, netdev; +Cc: 'mchan@broadcom.com', David Miller

On Wed, 2022-05-18 at 16:08 +0000, David Laight wrote:
> I'm trying to see why the tg3 driver is dropping a lot of
> receive packets.
> 
> (This driver is making my head hurt...)
> 
> I think that the rx_packets count (sum of rx_[umb]cast_packets)
> is all the packets, but a smaller number are actually processed
> by the tg3_rx()
> But none of the error counts get increased.
> 
> It is almost as if it has lost almost all the receive buffers.
> 
> If I read /sys/class/net/em2/statistics/rx_packets every second
> delaying with:
>   syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> about every 43 seconds I get a zero increment.
> This really doesn't help!

It looks like the tg3 driver fetches the H/W stats once per second. I
guess that if you fetch them with the same period and you are unlucky
you can read the same sample 2 consecutive time.

> I've put a count into tg3_rx() that seems to match what IP/UDP
> and the application see.
> 
> The traffic flow is pretty horrid (but could be worse).
> There are 8000 small UDP packets every 20ms.
> These are reasonably spread through the 20ms (not back to back).
> All the destination ports are different (8000 receiving sockets).
> (The receiving application handles this fine (now).)
> The packets come from two different systems.
> 
> Firstly RSS doesn't seem to work very well.
> With the current driver I think everything hits 2 rings.
> With the 3.10 RHEL driver it all ends up in one.
> 
> Anyway after a hint from Eric I enabled RPS.
> This offloads the IP and UDP processing enough to stop
> any of the cpu (only 40 of them) from reporting even 50% busy.
> 
> I've also increased the rx ring size to 2047.
> Changing the coalescing parameters seems to have no effect.
> 
> I think there should be 2047 receive buffers.
> So 4 interrupts every 20ms or 200/sec might be enough
> to receive all the frames.
> The actual interrupt rate (deltas on /proc/interrupts)
> is actual over 80000/sec.
> So it doesn't look as though the driver is ever processing
> many packets/interrupt.
> If the driver were getting behind I'd expect a smaller number
> of interrupts.

With RPS enabled packet processing for most packets (the ones stirred
to remote CPUs) is very cheap, as the skb are moved out of the NIC to a
per CPU queue and that's it.

In theory packets could be drepped before inserting them into the RPS
queue, if the latter grow to big, but that looks unlikely. You can try
raising netdev_max_backlog, just in case.


dropwatch (or perf record -ga -e skb:kfree_skb) should point you where
exactly the packets are dropped.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-18 17:27 ` Paolo Abeni
@ 2022-05-18 21:31   ` David Laight
  2022-05-19  0:52     ` Michael Chan
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-18 21:31 UTC (permalink / raw)
  To: 'Paolo Abeni', netdev; +Cc: 'mchan@broadcom.com', David Miller

From: Paolo Abeni
> Sent: 18 May 2022 18:27
....
> > If I read /sys/class/net/em2/statistics/rx_packets every second
> > delaying with:
> >   syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > about every 43 seconds I get a zero increment.
> > This really doesn't help!
> 
> It looks like the tg3 driver fetches the H/W stats once per second. I
> guess that if you fetch them with the same period and you are unlucky
> you can read the same sample 2 consecutive time.

Actually I think the hardware is writing them to kernel memory
every second.
This really isn't ideal for packet counts.

...
> With RPS enabled packet processing for most packets (the ones stirred
> to remote CPUs) is very cheap, as the skb are moved out of the NIC to a
> per CPU queue and that's it.

It may be 'cheap', but at 400000 frames/sec it adds up.
The processing in tg3 is relatively light - depending on
the actual hardware.
From what I remember of the e1000 driver that is worse.

> In theory packets could be drepped before inserting them into the RPS
> queue, if the latter grow to big, but that looks unlikely. You can try
> raising netdev_max_backlog, just in case.

I'm pretty sure nothing is being dropped that late on.
The cpu processing the RPS data are maxing at around 13% 'softint'
is 5.18 - quite a lot more than the 10% with 3.10.

> dropwatch (or perf record -ga -e skb:kfree_skb) should point you where
> exactly the packets are dropped.

I'm 99.9% sure the packets aren't getting into skb.
I can just about run very selective ftrace traces.
They don't seem to show anything being dropped.
By it is very difficult to see anything at these
packet rates.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-18 21:31   ` David Laight
@ 2022-05-19  0:52     ` Michael Chan
  2022-05-19  8:44       ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Michael Chan @ 2022-05-19  0:52 UTC (permalink / raw)
  To: David Laight; +Cc: Paolo Abeni, netdev, mchan, David Miller, Pavan Chebbi

On Wed, May 18, 2022 at 2:31 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Paolo Abeni
> > Sent: 18 May 2022 18:27
> ....
> > > If I read /sys/class/net/em2/statistics/rx_packets every second
> > > delaying with:
> > >   syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > about every 43 seconds I get a zero increment.
> > > This really doesn't help!
> >
> > It looks like the tg3 driver fetches the H/W stats once per second. I
> > guess that if you fetch them with the same period and you are unlucky
> > you can read the same sample 2 consecutive time.
>
> Actually I think the hardware is writing them to kernel memory
> every second.

On your BCM95720 chip, statistics are gathered by tg3_timer() once a
second.  Older chips will use DMA.

Please show a snapshot of all the counters.  In particular,
rxbds_empty, rx_discards, etc will show whether the driver is keeping
up with incoming RX packets or not.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-19  0:52     ` Michael Chan
@ 2022-05-19  8:44       ` David Laight
  2022-05-19 10:20         ` Pavan Chebbi
  2022-05-22 23:22         ` Michael Chan
  0 siblings, 2 replies; 20+ messages in thread
From: David Laight @ 2022-05-19  8:44 UTC (permalink / raw)
  To: 'Michael Chan'
  Cc: Paolo Abeni, netdev, mchan, David Miller, Pavan Chebbi

From: Michael Chan
> Sent: 19 May 2022 01:52
> 
> On Wed, May 18, 2022 at 2:31 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Paolo Abeni
> > > Sent: 18 May 2022 18:27
> > ....
> > > > If I read /sys/class/net/em2/statistics/rx_packets every second
> > > > delaying with:
> > > >   syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > > about every 43 seconds I get a zero increment.
> > > > This really doesn't help!
> > >
> > > It looks like the tg3 driver fetches the H/W stats once per second. I
> > > guess that if you fetch them with the same period and you are unlucky
> > > you can read the same sample 2 consecutive time.
> >
> > Actually I think the hardware is writing them to kernel memory
> > every second.
> 
> On your BCM95720 chip, statistics are gathered by tg3_timer() once a
> second.  Older chips will use DMA.

Ah, I wasn't sure which code was relevant.
FWIW the code could rotate 64bit values by 32 bits
to convert to/from the strange ordering the hardware uses.

> Please show a snapshot of all the counters.  In particular,
> rxbds_empty, rx_discards, etc will show whether the driver is keeping
> up with incoming RX packets or not.

After running the test for a short time.
The application stats indicate that around 40000 packets are missing.

# ethtool -S em2 | grep -v ' 0$'; for f in /sys/class/net/em2/statistics/*; do echo $f $(cat $f); done|grep -v ' 0$'
NIC statistics:
     rx_octets: 4589028558
     rx_ucast_packets: 21049866
     rx_mcast_packets: 763
     rx_bcast_packets: 746
     tx_octets: 4344
     tx_ucast_packets: 6
     tx_mcast_packets: 40
     tx_bcast_packets: 3
     rxbds_empty: 76
     rx_discards: 14
     mbuf_lwm_thresh_hit: 14
/sys/class/net/em2/statistics/multicast 763
/sys/class/net/em2/statistics/rx_bytes 4589028558
/sys/class/net/em2/statistics/rx_missed_errors 14
/sys/class/net/em2/statistics/rx_packets 21433169
/sys/class/net/em2/statistics/tx_bytes 4344
/sys/class/net/em2/statistics/tx_packets 49

I've replaced the rx_packets count with an atomic64 counter in tg3_rx().
Reading every second gives values like:

# echo_every 1 |(c=0; n0=0; while read r; do n=$(cat /sys/class/net/em2/statistics/rx_packets); echo $c $((n - n0)); c=$((c+1)); n0=$n; done)
0 397169949
1 399831
2 399883
3 399913
4 399871
5 398747
6 400035
7 399958
8 399947
9 399923
10 399978
11 399457
12 399130
13 400128
14 399808
15 399029

They should all be 400000 with slight variances.
But there are clearly 100s of packets being discarded in some
1 second periods.

I don't think I can blame the network.
All the systems are plugged into the same ethernet switch on a test LAN.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-19  8:44       ` David Laight
@ 2022-05-19 10:20         ` Pavan Chebbi
  2022-05-19 13:14           ` David Laight
  2022-05-22 23:22         ` Michael Chan
  1 sibling, 1 reply; 20+ messages in thread
From: Pavan Chebbi @ 2022-05-19 10:20 UTC (permalink / raw)
  To: David Laight; +Cc: Michael Chan, Paolo Abeni, netdev, mchan, David Miller

[-- Attachment #1: Type: text/plain, Size: 3422 bytes --]

On Thu, May 19, 2022 at 2:14 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Michael Chan
> > Sent: 19 May 2022 01:52
> >
> > On Wed, May 18, 2022 at 2:31 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Paolo Abeni
> > > > Sent: 18 May 2022 18:27
> > > ....
> > > > > If I read /sys/class/net/em2/statistics/rx_packets every second
> > > > > delaying with:
> > > > >   syscall(SYS_clock_nanosleep, CLOCK_MONOTONIC, TIMER_ABSTIME, &ts, NULL);
> > > > > about every 43 seconds I get a zero increment.
> > > > > This really doesn't help!
> > > >
> > > > It looks like the tg3 driver fetches the H/W stats once per second. I
> > > > guess that if you fetch them with the same period and you are unlucky
> > > > you can read the same sample 2 consecutive time.
> > >
> > > Actually I think the hardware is writing them to kernel memory
> > > every second.
> >
> > On your BCM95720 chip, statistics are gathered by tg3_timer() once a
> > second.  Older chips will use DMA.
>
> Ah, I wasn't sure which code was relevant.
> FWIW the code could rotate 64bit values by 32 bits
> to convert to/from the strange ordering the hardware uses.
>
> > Please show a snapshot of all the counters.  In particular,
> > rxbds_empty, rx_discards, etc will show whether the driver is keeping
> > up with incoming RX packets or not.
>
> After running the test for a short time.
> The application stats indicate that around 40000 packets are missing.
>
> # ethtool -S em2 | grep -v ' 0$'; for f in /sys/class/net/em2/statistics/*; do echo $f $(cat $f); done|grep -v ' 0$'
> NIC statistics:
>      rx_octets: 4589028558
>      rx_ucast_packets: 21049866
>      rx_mcast_packets: 763
>      rx_bcast_packets: 746
>      tx_octets: 4344
>      tx_ucast_packets: 6
>      tx_mcast_packets: 40
>      tx_bcast_packets: 3
>      rxbds_empty: 76
>      rx_discards: 14
>      mbuf_lwm_thresh_hit: 14
> /sys/class/net/em2/statistics/multicast 763
> /sys/class/net/em2/statistics/rx_bytes 4589028558
> /sys/class/net/em2/statistics/rx_missed_errors 14
> /sys/class/net/em2/statistics/rx_packets 21433169
> /sys/class/net/em2/statistics/tx_bytes 4344
> /sys/class/net/em2/statistics/tx_packets 49
>
> I've replaced the rx_packets count with an atomic64 counter in tg3_rx().
> Reading every second gives values like:
>
> # echo_every 1 |(c=0; n0=0; while read r; do n=$(cat /sys/class/net/em2/statistics/rx_packets); echo $c $((n - n0)); c=$((c+1)); n0=$n; done)
> 0 397169949
> 1 399831
> 2 399883
> 3 399913
> 4 399871
> 5 398747
> 6 400035
> 7 399958
> 8 399947
> 9 399923
> 10 399978
> 11 399457
> 12 399130
> 13 400128
> 14 399808
> 15 399029
>

I see that in a span of 15 seconds, the received packets are 4362 less
than what you are expecting (considering 400000/s avg)
In what time period did the application report 40000 missing packets?
Does it map to about 150 seconds of test time?
The error counters do not look suspicious at this point for the
reported problem.
Do you see this problem with any other traffic pattern?

> They should all be 400000 with slight variances.
> But there are clearly 100s of packets being discarded in some
> 1 second periods.
>
> I don't think I can blame the network.
> All the systems are plugged into the same ethernet switch on a test LAN.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-19 10:20         ` Pavan Chebbi
@ 2022-05-19 13:14           ` David Laight
  2022-05-19 13:29             ` Paolo Abeni
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-19 13:14 UTC (permalink / raw)
  To: 'Pavan Chebbi'
  Cc: Michael Chan, Paolo Abeni, netdev, mchan, David Miller

From: Pavan Chebbi
> Sent: 19 May 2022 11:21
...
> >
> > > Please show a snapshot of all the counters.  In particular,
> > > rxbds_empty, rx_discards, etc will show whether the driver is keeping
> > > up with incoming RX packets or not.
> >
> > After running the test for a short time.
> > The application stats indicate that around 40000 packets are missing.
> >
...

Some numbers taken at the same time:

Application trace - each 'gap' is one or more lost packets.
T+000004:  all gaps so far 1104
T+000005:  all gaps so far 21664
T+000006:  all gaps so far 54644
T+000007:  all gaps so far 84641
T+000008:  all gaps so far 110232
T+000009:  all gaps so far 131191
T+000010:  all gaps so far 150286
T+000011:  all gaps so far 171588
T+000012:  all gaps so far 190777
T+000013:  all gaps so far 210771

rx_packets counted by tg3_rx() and read every second.
63 344426
64 341734
65 338740
66 337995
67 339770
68 336314
69 340087
70 345084

Cumulative error counts since the driver was last loaded.
     rxbds_empty: 30983
     rx_discards: 3123
     mbuf_lwm_thresh_hit: 3123

The number of interrupt is high - about 40000/sec.
(I've not deltad these, just removed all the zeros and prefixed the
cpu number before each non-zero value.)
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234754517
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234767945
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234802555
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234843542
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234887963
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234928204
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234966428
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235009505
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235052740
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235093254
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235133299
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235173151
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235212387
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235252403
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235317928
86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235371301

RSS is enabled, but I've used ethtool -X equal 1 to
put everything through ring 0.
Cpu 14 is still 25% idle - that is the busiest cpu.

I've discovered that the 'lost packet' rate does depend on
the number of rx buffers configured with 'ethtool -G em2 rx nnnn'.
The traces above are with 1000 rx buffers.

I'm also slightly confused about the receive buffers.
As I read the code the following happens:

Ignoring jumbo buffers - which I don't have configured.
AFAICT all the rings have 2048 entries.
With RSS there are 4 pairs of rings, one contains (free) buffers
the other receive data status.
The receive code processes an entry from the status ring
and puts a buffer back onto the corresponding buffers ring.
Since the hardware only takes buffers from one ring, the
driver moves all the free buffers from rings 1-3 onto ring 0.

When the rings are allocated I think that buffers (default 200)
are added to all 4 rings.
As soon as the 'napi' code for ring 0 runs it collects the
other 600 buffers and puts them on its own (free) buffer ring.
This seems to make all 800 buffers available for any of the RSS
channels.

Now if I configure 'ethtool -G em2 rx 2000' a total of 8000
receive buffers are allocated.
Only 2047 will fit into ring[0] so the other 'buffer' rings
still contain buffers.
Now if I receive traffic that goes to ring[3] the free buffer
ring[3] will wrap - discarding 2048 buffers.

I'm assuming I've missed something?

This bit of code in tg3_rx() also looks buggy:

                if (unlikely(rx_std_posted >= tp->rx_std_max_post)) {
                        tpr->rx_std_prod_idx = std_prod_idx &
                                               tp->rx_std_ring_mask;
                        tw32_rx_mbox(TG3_RX_STD_PROD_IDX_REG,
                                     tpr->rx_std_prod_idx);
                        work_mask &= ~RXD_OPAQUE_RING_STD;
                        rx_std_posted = 0;
                }

Clearing work_mask stops napi[0] being run to move
the freed buffers across.
(I don't think I have the hardware that goes through that bit.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-19 13:14           ` David Laight
@ 2022-05-19 13:29             ` Paolo Abeni
  2022-05-19 13:54               ` Andrew Lunn
  2022-05-19 14:11               ` David Laight
  0 siblings, 2 replies; 20+ messages in thread
From: Paolo Abeni @ 2022-05-19 13:29 UTC (permalink / raw)
  To: David Laight, 'Pavan Chebbi'
  Cc: Michael Chan, netdev, mchan, David Miller

On Thu, 2022-05-19 at 13:14 +0000, David Laight wrote:
> From: Pavan Chebbi
> > Sent: 19 May 2022 11:21
> ...
> > > 
> > > > Please show a snapshot of all the counters.  In particular,
> > > > rxbds_empty, rx_discards, etc will show whether the driver is keeping
> > > > up with incoming RX packets or not.
> > > 
> > > After running the test for a short time.
> > > The application stats indicate that around 40000 packets are missing.
> > > 
> ...
> 
> Some numbers taken at the same time:
> 
> Application trace - each 'gap' is one or more lost packets.
> T+000004:  all gaps so far 1104
> T+000005:  all gaps so far 21664
> T+000006:  all gaps so far 54644
> T+000007:  all gaps so far 84641
> T+000008:  all gaps so far 110232
> T+000009:  all gaps so far 131191
> T+000010:  all gaps so far 150286
> T+000011:  all gaps so far 171588
> T+000012:  all gaps so far 190777
> T+000013:  all gaps so far 210771
> 
> rx_packets counted by tg3_rx() and read every second.
> 63 344426
> 64 341734
> 65 338740
> 66 337995
> 67 339770
> 68 336314
> 69 340087
> 70 345084
> 
> Cumulative error counts since the driver was last loaded.
>      rxbds_empty: 30983
>      rx_discards: 3123
>      mbuf_lwm_thresh_hit: 3123
> 
> The number of interrupt is high - about 40000/sec.
> (I've not deltad these, just removed all the zeros and prefixed the
> cpu number before each non-zero value.)
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234754517
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234767945
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234802555
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234843542
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234887963
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234928204
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:234966428
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235009505
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235052740
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235093254
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235133299
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235173151
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235212387
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235252403
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235317928
> 86: IR-PCI-MSI 1050625-edge em2-rx-1 8:13 14:235371301
> 
> RSS is enabled, but I've used ethtool -X equal 1 to
> put everything through ring 0.
> Cpu 14 is still 25% idle - that is the busiest cpu.

If the packet processing is 'bursty', you can have idle time and still
hit now and the 'rx ring is [almost] full' condition. If pause frames
are enabled, that will cause the peer to stop sending frames: drop can
happen in the switch, and the local NIC will not notice (unless there
are counters avaialble for pause frames sent).

AFAICS the packet processing is bursty, because enqueuing packets to a
remote CPU in considerably faster then full network stack processing.

Side note: on a not-to-obsolete H/W the kernel should be able to
process >1mpps per cpu.

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-19 13:29             ` Paolo Abeni
@ 2022-05-19 13:54               ` Andrew Lunn
  2022-05-19 14:11               ` David Laight
  1 sibling, 0 replies; 20+ messages in thread
From: Andrew Lunn @ 2022-05-19 13:54 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David Laight, 'Pavan Chebbi',
	Michael Chan, netdev, mchan, David Miller

> If the packet processing is 'bursty', you can have idle time and still
> hit now and the 'rx ring is [almost] full' condition. If pause frames
> are enabled, that will cause the peer to stop sending frames: drop can
> happen in the switch, and the local NIC will not notice (unless there
> are counters avaialble for pause frames sent).

You can control pause with ethtool

       ethtool -a|--show-pause devname

       ethtool -A|--pause devname [autoneg on|off] [rx on|off] [tx on|off]

But it depends on the driver implementing these options. So you might
want to try autoneg on, rx off, tx off, for testing.

ethtool should also show you want has been negotiated for pause, if
the driver has implemented that part of the ethtool API.

    Andrew

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-19 13:29             ` Paolo Abeni
  2022-05-19 13:54               ` Andrew Lunn
@ 2022-05-19 14:11               ` David Laight
  2022-05-19 14:35                 ` Pavan Chebbi
  1 sibling, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-19 14:11 UTC (permalink / raw)
  To: 'Paolo Abeni', 'Pavan Chebbi'
  Cc: Michael Chan, netdev, mchan, David Miller

From: Paolo Abeni
> Sent: 19 May 2022 14:29
....
> If the packet processing is 'bursty', you can have idle time and still
> hit now and the 'rx ring is [almost] full' condition. If pause frames
> are enabled, that will cause the peer to stop sending frames: drop can
> happen in the switch, and the local NIC will not notice (unless there
> are counters avaialble for pause frames sent).

The test program sending the data does spread it out.
So it isn't sending 2000 packets with minimal IPG every 10ms.
(I'm sending from 2 systems.)

I don't know if pause frames are enabled (ethtool might suggest they are).
But detecting whether they are sent is another matter.

In any case sending pause frames doesn't fix anything.
They are largely entirely useless unless you have a cable
that directly connects two computers.

> AFAICS the packet processing is bursty, because enqueuing packets to a
> remote CPU in considerably faster then full network stack processing.

I have taken restricted ftrace traces of the receiving system.
Not often seen more than 4 frames processed in one napi callback
Certainly didn't spot blocks of 100+ that you might expect
to see if the driver code was the bottleneck.

> Side note: on a not-to-obsolete H/W the kernel should be able to
> process >1mpps per cpu.

Yes, and, IIRC, a 33Mhz 486 can saturate 10MHz ethernet with
small packets.

In this case the cpu are almost twiddling their thumbs.
  model name      : Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
  stepping        : 2
  microcode       : 0x43
  cpu MHz         : 1300.000
cpu 14 (the one taking the interrupts) is running at full speed.

cpu doesn't seem to be the bottleneck.
The problem seems to be the hardware not using all the buffers
it has been given.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-19 14:11               ` David Laight
@ 2022-05-19 14:35                 ` Pavan Chebbi
  2022-05-19 14:42                   ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Pavan Chebbi @ 2022-05-19 14:35 UTC (permalink / raw)
  To: David Laight; +Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

[-- Attachment #1: Type: text/plain, Size: 2147 bytes --]

On Thu, May 19, 2022 at 7:41 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Paolo Abeni
> > Sent: 19 May 2022 14:29
> ....
> > If the packet processing is 'bursty', you can have idle time and still
> > hit now and the 'rx ring is [almost] full' condition. If pause frames
> > are enabled, that will cause the peer to stop sending frames: drop can
> > happen in the switch, and the local NIC will not notice (unless there
> > are counters avaialble for pause frames sent).
>
> The test program sending the data does spread it out.
> So it isn't sending 2000 packets with minimal IPG every 10ms.
> (I'm sending from 2 systems.)
>
> I don't know if pause frames are enabled (ethtool might suggest they are).
> But detecting whether they are sent is another matter.
>
> In any case sending pause frames doesn't fix anything.
> They are largely entirely useless unless you have a cable
> that directly connects two computers.
>
> > AFAICS the packet processing is bursty, because enqueuing packets to a
> > remote CPU in considerably faster then full network stack processing.
>
> I have taken restricted ftrace traces of the receiving system.
> Not often seen more than 4 frames processed in one napi callback
> Certainly didn't spot blocks of 100+ that you might expect
> to see if the driver code was the bottleneck.
>
> > Side note: on a not-to-obsolete H/W the kernel should be able to
> > process >1mpps per cpu.
>
> Yes, and, IIRC, a 33Mhz 486 can saturate 10MHz ethernet with
> small packets.
>
> In this case the cpu are almost twiddling their thumbs.
>   model name      : Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
>   stepping        : 2
>   microcode       : 0x43
>   cpu MHz         : 1300.000
> cpu 14 (the one taking the interrupts) is running at full speed.
>
> cpu doesn't seem to be the bottleneck.
> The problem seems to be the hardware not using all the buffers
> it has been given.
>

When this happens, can you provide the register dump that you can
obtain using ethtool -d ?

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-19 14:35                 ` Pavan Chebbi
@ 2022-05-19 14:42                   ` David Laight
  2022-05-20 16:08                     ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-19 14:42 UTC (permalink / raw)
  To: 'Pavan Chebbi'
  Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

From: Pavan Chebbi
> Sent: 19 May 2022 15:36
...
> When this happens, can you provide the register dump that you can
> obtain using ethtool -d ?

Two dumps a few seconds apart:

# (echo x x x x; ethtool -d em2)|xargs -n8
x x x x Offset Value ------ ----------
0x0000 0x165f14e4 0x0004 0x00100406 0x0008 0x02000000 0x000c 0x00800000
0x0010 0x91a0000c 0x0018 0x91a1000c 0x0020 0x91a2000c 0x002c 0x06391028
0x0030 0xfffc0000 0x0034 0x00000048 0x003c 0x00000206 0x0044 0x8f000000
0x0048 0xc8035001 0x004c 0x64002008 0x0050 0x818c5803 0x0054 0x78000000
0x0058 0x0086a005 0x0068 0xf0000298 0x006c 0x00380081 0x0070 0x000710b0
0x0074 0xfd8044ff 0x0080 0x165f14e4 0x0084 0x00000012 0x008c 0x00000484
0x009c 0x00000071 0x00a0 0x8010ac11 0x00a4 0x00000004 0x00a8 0x00001004
0x00ac 0x00020010 0x0204 0x8f000000 0x020c 0x28000000 0x0214 0x01000000
0x021c 0x01000000 0x0224 0x01000000 0x026c 0x0000008a 0x0284 0x000004a7
0x0304 0x0000009d 0x0400 0x18e04808 0x0404 0x00400000 0x0408 0x00001000
0x040c 0x00000c80 0x0410 0x00001866 0x0414 0xdaa3bd74 0x0418 0x00001866
0x041c 0xdaa3bd74 0x0420 0x00001866 0x0424 0xdaa3bd74 0x0428 0x00001866
0x042c 0xdaa3bd74 0x0430 0x00000400 0x0438 0x00000027 0x043c 0x000005f2
0x044c 0x084a3800 0x0450 0x00000001 0x0454 0x00008000 0x045c 0x00000102
0x0460 0x00000008 0x0464 0x00002620 0x0468 0x01ff0002 0x0470 0x80020000
0x047c 0x40000000 0x0480 0x42000000 0x0484 0x7fffffff 0x0488 0x06000004
0x048c 0x7fffffff 0x0500 0x00000008 0x0504 0x00000002 0x0590 0x00e00000
0x05b4 0x00000008 0x05c0 0x8bd00c98 0x05c4 0xaf1fd203 0x0600 0xffffffff
0x0604 0x00f80011 0x060c 0x00001f04 0x0610 0xfd96fc1d 0x0614 0x03800400
0x0618 0x07c01191 0x0620 0x00001840 0x0670 0xfbe8e1be 0x0674 0x0b19b642
0x0678 0x99253094 0x067c 0x8cd32315 0x0680 0xfdf2db46 0x0684 0x1e047b78
0x0688 0xb6720331 0x068c 0xcf31fe6e 0x0690 0x6d3ec110 0x0694 0xfd6c05b3
0x06c8 0x04000000 0x0804 0xffffffff 0x0814 0xffffffff 0x0828 0xffffffff
0x0834 0xffffffff 0x0838 0xffffffff 0x083c 0xffffffff 0x0840 0xffffffff
0x0844 0xffffffff 0x0848 0xffffffff 0x084c 0xffffffff 0x0850 0xffffffff
0x0854 0xffffffff 0x0858 0xffffffff 0x085c 0xffffffff 0x0860 0xffffffff
0x0864 0xffffffff 0x0868 0xffffffff 0x0880 0x00097d82 0x0884 0x000000fc
0x088c 0x00000b27 0x0c00 0x0000000a 0x0c08 0x00000003 0x0c0c 0x00000001
0x0c1c 0x002a0000 0x0c80 0x00000001 0x1000 0x00000002 0x1400 0x00000006
0x1440 0x0000009d 0x1444 0x0000009d 0x1448 0x0000009d 0x144c 0x0000009d
0x1450 0x0000009d 0x1454 0x0000009d 0x1458 0x0000009d 0x145c 0x0000009d
0x1460 0x0000009d 0x1464 0x0000009d 0x1468 0x0000009d 0x146c 0x0000009d
0x1470 0x0000009d 0x1474 0x0000009d 0x1478 0x0000009d 0x147c 0x0000009d
0x1800 0x00000016 0x1808 0x0000009d 0x183c 0x058c8000 0x1840 0x058c8000
0x1844 0x00000001 0x1c00 0x00000002 0x2000 0x00000002 0x2004 0x00000010
0x2010 0x00000181 0x2014 0x00000001 0x2018 0x00780003 0x2100 0x000c180e
0x2104 0x000c1e11 0x2110 0x000c301a 0x2114 0x000c301a 0x2120 0x000c4223
0x2124 0x000c4826 0x2130 0x000c542c 0x2134 0x000c5a2f 0x2140 0x000c6c38
0x2144 0x000c6c38 0x2150 0x000c7e41 0x2154 0x000c8444 0x2160 0x000c964d
0x2164 0x000c9c50 0x2170 0x000ca856 0x2174 0x000cae59 0x2180 0x000cba5f
0x2184 0x000cc062 0x2190 0x000cd26b 0x2194 0x000cd86e 0x21a0 0x000ce474
0x21a4 0x000cea77 0x21b0 0x000cf67d 0x21b4 0x000cfc80 0x21c0 0x000d0886
0x21c4 0x000d0e89 0x21d0 0x000d2092 0x21d4 0x000d2695 0x21e0 0x000d329b
0x21e4 0x000d389e 0x21f0 0x000d44a4 0x21f4 0x000d4aa7 0x2200 0x00234d39
0x2250 0x000055e8 0x2400 0x00010012 0x2408 0x00200001 0x2448 0x00000002
0x244c 0x00044400 0x2450 0x00000001 0x2454 0x06970000 0x2458 0x08001800
0x245c 0x00040000 0x2474 0x000006bb 0x2800 0x00000006 0x2c00 0x00000006
0x2c0c 0x0000074d 0x2c18 0x00000040 0x2c1c 0x0000000c 0x3000 0x00000006
0x300c 0x00000752 0x3800 0x00000001 0x3c00 0x00000306 0x3c0c 0x00000048
0x3c14 0x00000035 0x3c24 0x00000005 0x3c38 0x00000001 0x3c3c 0x09a14000
0x3c44 0x00000b00 0x3c54 0x000006eb 0x3c80 0x000006f4 0x3cc0 0x0000009d
0x3cd4 0x0000000f 0x3d80 0x00000014 0x3d88 0x00000005 0x3d90 0x00000005
0x3d98 0x00000014 0x3da0 0x00000005 0x3da8 0x00000005 0x3db0 0x00000014
0x3db8 0x00000005 0x3dc0 0x00000005 0x3dc8 0x00000014 0x3dd0 0x00000005
0x3dd8 0x00000005 0x4000 0x00000002 0x4008 0x00141d03 0x400c 0x00114653
0x4400 0x00000016 0x4408 0x00010000 0x440c 0x0000a000 0x4414 0x0000002a
0x4418 0x000000a0 0x4420 0x0000003d 0x444c 0x04310c45 0x4450 0x000e0339
0x4454 0x00c800cf 0x4800 0x380303fe 0x4c00 0x200003fe 0x5000 0x00009800
0x5004 0x80004000 0x501c 0x08001f88 0x5034 0x0000001d 0x5800 0x8f000000
0x5804 0x8f000000 0x5808 0x37000000 0x5810 0x01000000 0x5818 0x01000000
0x5820 0x01000000 0x5868 0x000000c1 0x586c 0x000000c1 0x5880 0x000004d9
0x5884 0x000004d9 0x5900 0x0000009d 0x5904 0x0000009d 0x6000 0x00010082
0x6800 0x141b0034 0x6804 0x20099082 0x6808 0x01069208 0x680c 0x5421cc3e
0x6810 0x21020000 0x6814 0xffffffff 0x6830 0xffffffff 0x6834 0xffffffff
0x6844 0x00000001 0x7000 0x00000008 0x700c 0x00004a4c 0x7010 0xf8ba8764
0x7014 0x01c080f3 0x7018 0x00d70081 0x701c 0x03008200

# (echo x x x x; ethtool -d em2)|xargs -n8
x x x x Offset Value ------ ----------
0x0000 0x165f14e4 0x0004 0x00100406 0x0008 0x02000000 0x000c 0x00800000
0x0010 0x91a0000c 0x0018 0x91a1000c 0x0020 0x91a2000c 0x002c 0x06391028
0x0030 0xfffc0000 0x0034 0x00000048 0x003c 0x00000206 0x0044 0x59000000
0x0048 0xc8035001 0x004c 0x64002008 0x0050 0x818c5803 0x0054 0x78000000
0x0058 0x0086a005 0x0068 0xf0000298 0x006c 0x00380081 0x0070 0x000710b0
0x0074 0xfd8044ff 0x0080 0x165f14e4 0x0084 0x00000012 0x008c 0x000005f0
0x009c 0x000001d8 0x00a0 0x8010ac11 0x00a4 0x00000004 0x00a8 0x00001004
0x00ac 0x00020010 0x0204 0x59000000 0x020c 0x33000000 0x0214 0x01000000
0x021c 0x01000000 0x0224 0x01000000 0x026c 0x000001d8 0x0284 0x000005f0
0x0304 0x0000009f 0x0400 0x18e04808 0x0404 0x00400000 0x0408 0x00001000
0x040c 0x00000880 0x0410 0x00001866 0x0414 0xdaa3bd74 0x0418 0x00001866
0x041c 0xdaa3bd74 0x0420 0x00001866 0x0424 0xdaa3bd74 0x0428 0x00001866
0x042c 0xdaa3bd74 0x0430 0x00000400 0x0438 0x0000006d 0x043c 0x000005f2
0x044c 0x084a3800 0x0450 0x00000001 0x0454 0x00008000 0x045c 0x00000102
0x0460 0x00000008 0x0464 0x00002620 0x0468 0x01ff0002 0x0470 0x80020000
0x047c 0x40000000 0x0480 0x42000000 0x0484 0x7fffffff 0x0488 0x06000004
0x048c 0x7fffffff 0x0500 0x00000008 0x0504 0x00000002 0x0590 0x00e00000
0x05b4 0x00000008 0x05c0 0x8bd00c98 0x05c4 0xaf1fd203 0x0600 0xffffffff
0x0604 0x00f80011 0x060c 0x00001f04 0x0610 0x0130a264 0x0614 0x10003600
0x0618 0x07c02f11 0x0620 0x00001840 0x0670 0xfbe8e1be 0x0674 0x0b19b642
0x0678 0x99253094 0x067c 0x8cd32315 0x0680 0xfdf2db46 0x0684 0x1e047b78
0x0688 0xb6720331 0x068c 0xcf31fe6e 0x0690 0x6d3ec110 0x0694 0xfd6c05b3
0x06c8 0x04000000 0x0804 0xffffffff 0x0814 0xffffffff 0x0828 0xffffffff
0x0834 0xffffffff 0x0838 0xffffffff 0x083c 0xffffffff 0x0840 0xffffffff
0x0844 0xffffffff 0x0848 0xffffffff 0x084c 0xffffffff 0x0850 0xffffffff
0x0854 0xffffffff 0x0858 0xffffffff 0x085c 0xffffffff 0x0860 0xffffffff
0x0864 0xffffffff 0x0868 0xffffffff 0x0880 0x03b3bdba 0x0884 0x000003f0
0x088c 0x000458f3 0x0890 0x00000001 0x0894 0x00000001 0x0c00 0x0000000a
0x0c08 0x00000003 0x0c0c 0x00000001 0x0c1c 0x002a0000 0x0c80 0x00000002
0x1000 0x00000002 0x1400 0x00000006 0x1440 0x0000009f 0x1444 0x0000009f
0x1448 0x0000009f 0x144c 0x0000009f 0x1450 0x0000009f 0x1454 0x0000009f
0x1458 0x0000009f 0x145c 0x0000009f 0x1460 0x0000009f 0x1464 0x0000009f
0x1468 0x0000009f 0x146c 0x0000009f 0x1470 0x0000009f 0x1474 0x0000009f
0x1478 0x0000009f 0x147c 0x0000009f 0x1800 0x00000016 0x1808 0x0000009f
0x183c 0x058c8000 0x1840 0x058c8000 0x1844 0x00000001 0x1c00 0x00000002
0x2000 0x00000002 0x2004 0x00000010 0x2010 0x00000181 0x2014 0x00000001
0x2018 0x00780003 0x2100 0x000c6a37 0x2104 0x000c703a 0x2110 0x000c8243
0x2114 0x000c8243 0x2120 0x000c8243 0x2124 0x000c8846 0x2130 0x000c944c
0x2134 0x000c9a4f 0x2140 0x000ca052 0x2144 0x000ca655 0x2150 0x000cb25b
0x2154 0x000cb25b 0x2160 0x000cbe61 0x2164 0x000cc464 0x2170 0x000cd06a
0x2174 0x000cd66d 0x2180 0x000ce273 0x2184 0x000ce273 0x2190 0x000cf47c
0x2194 0x000cf47c 0x21a0 0x000d0685 0x21a4 0x000d0c88 0x21b0 0x000d128b
0x21b4 0x000d128b 0x21c0 0x000d1e91 0x21c4 0x000d2494 0x21d0 0x000d2a97
0x21d4 0x000d2a97 0x21e0 0x000d3ca0 0x21e4 0x000d42a3 0x21f0 0x000d4ea9
0x21f4 0x000d54ac 0x2200 0x00a51132 0x224c 0x0000004a 0x2250 0x000087f6
0x2400 0x00010012 0x2408 0x00200001 0x2448 0x00000002 0x244c 0x00044400
0x2450 0x00000001 0x2454 0x06970000 0x2458 0x08001800 0x245c 0x00040000
0x2474 0x000007be 0x2800 0x00000006 0x2c00 0x00000006 0x2c0c 0x00000048
0x2c18 0x00000040 0x2c1c 0x0000000c 0x3000 0x00000006 0x300c 0x0000004b
0x3800 0x00000001 0x3c00 0x00000306 0x3c0c 0x00000048 0x3c14 0x00000035
0x3c24 0x00000005 0x3c38 0x00000001 0x3c3c 0x09a14000 0x3c44 0x00000b00
0x3c48 0x00000040 0x3c54 0x000007e3 0x3c80 0x000007e9 0x3cc0 0x0000009f
0x3cd4 0x0000000f 0x3d80 0x00000014 0x3d88 0x00000005 0x3d90 0x00000005
0x3d98 0x00000014 0x3da0 0x00000005 0x3da8 0x00000005 0x3db0 0x00000014
0x3db8 0x00000005 0x3dc0 0x00000005 0x3dc8 0x00000014 0x3dd0 0x00000005
0x3dd8 0x00000005 0x4000 0x00000002 0x4008 0x00141d03 0x400c 0x00114653
0x4400 0x00000016 0x4408 0x00010000 0x440c 0x0000a000 0x4414 0x0000002a
0x4418 0x000000a0 0x4420 0x0000003d 0x444c 0x04511447 0x4450 0x0008033c
0x4454 0x00310035 0x4800 0x380303fe 0x4c00 0x200003fe 0x5000 0x00009800
0x5004 0x80004000 0x501c 0x08001f88 0x5034 0x0000001d 0x5800 0x59000000
0x5804 0x59000000 0x5808 0x33000000 0x5810 0x01000000 0x5818 0x01000000
0x5820 0x01000000 0x5868 0x000001d8 0x586c 0x000001d8 0x5880 0x000005f0
0x5884 0x000005f0 0x5900 0x0000009f 0x5904 0x0000009f 0x6000 0x00010082
0x6800 0x141b0034 0x6804 0x20099082 0x6808 0x01069208 0x680c 0x56014dc5
0x6810 0x29020000 0x6814 0xffffffff 0x6830 0xffffffff 0x6834 0xffffffff
0x6844 0x00000001 0x7000 0x00000008 0x700c 0x00004a4c 0x7010 0xf8ba8764
0x7014 0x01c080f3 0x7018 0x00d70081 0x701c 0x03008200

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-19 14:42                   ` David Laight
@ 2022-05-20 16:08                     ` David Laight
  2022-05-23 16:01                       ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-20 16:08 UTC (permalink / raw)
  To: David Laight, 'Pavan Chebbi'
  Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

I've hijacked 'ethtool -d' to output some instrumentation.

Basically doing:
        tp->rx_avail[(hw_idx - sw_idx) & tp->rx_ret_ring_mask]++;
in both the hardware interrupt and napi callback.
(To different arrays.)

These tests were done after setting:
ethtool -G em2 rx 2047
ethtool -C em2 rx-usecs 1
ethtool -C em2 rx-frames 1
ethtool -C em2 rx-frames-irq 1
while the system was idle.

After running or a few seconds the number of pending packets
in the hardware interrupt isn't too bad:
# ethtool -d em2|grep '^0.[4-7]'|xargs -n8
0x4000 0x000e6004 0x4004 0x009c794f 0x4008 0x0024ccf0 0x400c 0x00037fb1
0x4010 0x00004ddd 0x4014 0x00001a05 0x4018 0x00000d92 0x401c 0x00000603
0x4020 0x000002a4 0x4024 0x000000f0 0x4028 0x00000061 0x402c 0x00000044
0x4030 0x0000001c 0x4034 0x00000014 0x4038 0x0000000c 0x403c 0x0000000b
0x4040 0x00000005 0x4044 0x00000001 0x4048 0x00000001 0x404c 0x00000004
0x4054 0x00000001 0x4090 0x00000001 0x4094 0x00000001 0x40a8 0x00000001
0x40c4 0x00000001 0x4138 0x00000001 0x4148 0x00000001 0x4154 0x00000001
0x4160 0x00000002 0x4178 0x00000001 0x4198 0x00000001 0x41a4 0x00000001
0x41c8 0x00000001 0x41cc 0x00000001 0x41d0 0x00000001 0x41d8 0x00000001
0x41e8 0x00000001 0x41f4 0x00000001 0x4200 0x00000001 0x4204 0x00000001
0x420c 0x00000001 0x4210 0x00000001 0x4218 0x00000001 0x4220 0x00000002
0x4224 0x00000004 0x4228 0x00000003 0x422c 0x00000002 0x4230 0x00000001
0x4234 0x00000001 0x4238 0x00000002 0x423c 0x00000002 0x4240 0x00000003
0x4244 0x00000004 0x4248 0x00000008 0x424c 0x00000008 0x4250 0x00000002
0x4254 0x00000008 0x4258 0x00000003 0x425c 0x00000004 0x4260 0x00000003
0x4264 0x0000000b 0x4268 0x00000006 0x426c 0x00000004 0x4270 0x00000004
0x4274 0x00000004 0x4278 0x00000003 0x427c 0x00000005 0x4280 0x00000003
0x4284 0x00000004 0x4288 0x00000005 0x428c 0x00000004 0x4290 0x00000003
0x4294 0x00000002 0x4298 0x00000003 0x429c 0x00000003 0x42a0 0x00000001
0x42a4 0x00000001 0x42a8 0x00000001 0x42ac 0x00000004 0x42b0 0x00000001
0x42b8 0x00000001 0x42c0 0x00000001 0x42c8 0x00000002 0x42d0 0x00000001
0x42d4 0x00000001 0x42e0 0x00000001 0x4304 0x00000001 0x4314 0x00000001
0x4340 0x00000001 0x439c 0x00000001 0x43b0 0x00000001 0x43dc 0x00000001
0x4410 0x00000001 0x44a0 0x00000001

In the napi callback (tg3_rx()) it is a different story:
# ethtool -d em2|grep '^0.[0-3]'|xargs -n8
0x0000 0x010778da 0x0004 0x011537a8 0x0008 0x004229f2 0x000c 0x000a583d
0x0010 0x0001dd9f 0x0014 0x0000afa6 0x0018 0x0000606d 0x001c 0x00003b80
0x0020 0x0000273f 0x0024 0x00001b72 0x0028 0x000013f9 0x002c 0x00000f15
0x0030 0x00000ce6 0x0034 0x00000a4e 0x0038 0x000008ac 0x003c 0x000007c5
0x0040 0x00000647 0x0044 0x0000059f 0x0048 0x0000053f 0x004c 0x00000486
0x0050 0x0000043a 0x0054 0x000003ad 0x0058 0x00000364 0x005c 0x0000034d
0x0060 0x00000305 0x0064 0x00000301 0x0068 0x000002cf 0x006c 0x00000297
0x0070 0x000002a3 0x0074 0x00000295 0x0078 0x0000027f 0x007c 0x0000028f
0x0080 0x00000278 0x0084 0x00000260 0x0088 0x00000251 0x008c 0x00000253
0x0090 0x0000024d 0x0094 0x0000023a 0x0098 0x00000245 0x009c 0x00000233
0x00a0 0x0000023f 0x00a4 0x00000207 0x00a8 0x0000022e 0x00ac 0x00000203
0x00b0 0x0000022e 0x00b4 0x00000212 0x00b8 0x00000203 0x00bc 0x0000020b
...
lots of numbers that go down slowly
...
0x0d00 0x0000012e 0x0d04 0x00000127 0x0d08 0x00000138 0x0d0c 0x00000125
0x0d10 0x00000116 0x0d14 0x00000117 0x0d18 0x00000120 0x0d1c 0x0000011f
0x0d20 0x00000135 0x0d24 0x00000120 0x0d28 0x00000111 0x0d2c 0x00000131
0x0d30 0x00000131 0x0d34 0x0000010e 0x0d38 0x00000110 0x0d3c 0x00000114
0x0d40 0x00000100 0x0d44 0x0000011b 0x0d48 0x00000113 0x0d4c 0x00000108
0x0d50 0x000000fe 0x0d54 0x00000103 0x0d58 0x000000f2 0x0d5c 0x00000124
0x0d60 0x000000fb 0x0d64 0x000000fe 0x0d68 0x000000f5 0x0d6c 0x00000114
0x0d70 0x000000f5 0x0d74 0x000000da 0x0d78 0x000000d8 0x0d7c 0x000000f8
0x0d80 0x000000db 0x0d84 0x000000e6 0x0d88 0x00000100 0x0d8c 0x000000d7
0x0d90 0x000000e3 0x0d94 0x000000d6 0x0d98 0x000000cf 0x0d9c 0x000000cf
0x0da0 0x000000d0 0x0da4 0x000000df 0x0da8 0x000000cf 0x0dac 0x000000bb
0x0db0 0x000000d1 0x0db4 0x000000ba 0x0db8 0x000000cc 0x0dbc 0x000000bc
0x0dc0 0x000000b5 0x0dc4 0x000000c5 0x0dc8 0x000000c3 0x0dcc 0x000000b7
0x0dd0 0x000000bd 0x0dd4 0x000000a1 0x0dd8 0x0000009c 0x0ddc 0x000000bc
0x0de0 0x000000ba 0x0de4 0x000000c5 0x0de8 0x0000009d 0x0dec 0x000000a0
0x0df0 0x0000009f 0x0df4 0x00000092 0x0df8 0x000000be 0x0dfc 0x00000089
0x0e00 0x0000007c 0x0e04 0x00000089 0x0e08 0x00000090 0x0e0c 0x0000008a
0x0e10 0x00000081 0x0e14 0x0000007d 0x0e18 0x00000085 0x0e1c 0x00000091
0x0e20 0x00000077 0x0e24 0x0000007e 0x0e28 0x00000076 0x0e2c 0x00000071
0x0e30 0x00000067 0x0e34 0x00000063 0x0e38 0x0000005d 0x0e3c 0x00000067
0x0e40 0x00000070 0x0e44 0x00000068 0x0e48 0x00000060 0x0e4c 0x00000071
0x0e50 0x0000005b 0x0e54 0x00000055 0x0e58 0x00000068 0x0e5c 0x0000004b
0x0e60 0x0000004f 0x0e64 0x00000052 0x0e68 0x00000053 0x0e6c 0x00000041
0x0e70 0x00000049 0x0e74 0x00000055 0x0e78 0x0000005a 0x0e7c 0x00000050
0x0e80 0x0000003f 0x0e84 0x0000003c 0x0e88 0x0000004d 0x0e8c 0x0000003d
0x0e90 0x0000003d 0x0e94 0x00000035 0x0e98 0x0000003a 0x0e9c 0x00000034
0x0ea0 0x00000036 0x0ea4 0x0000003f 0x0ea8 0x00000035 0x0eac 0x0000003d
0x0eb0 0x00000033 0x0eb4 0x0000003b 0x0eb8 0x00000032 0x0ebc 0x00000032
0x0ec0 0x00000027 0x0ec4 0x0000002c 0x0ec8 0x00000030 0x0ecc 0x00000038
0x0ed0 0x0000002e 0x0ed4 0x00000027 0x0ed8 0x00000029 0x0edc 0x00000021
0x0ee0 0x00000026 0x0ee4 0x0000002a 0x0ee8 0x0000002b 0x0eec 0x00000020
0x0ef0 0x0000001d 0x0ef4 0x0000001b 0x0ef8 0x00000028 0x0efc 0x00000028
0x0f00 0x0000001d 0x0f04 0x00000018 0x0f08 0x00000025 0x0f0c 0x00000022
0x0f10 0x00000018 0x0f14 0x00000021 0x0f18 0x0000001d 0x0f1c 0x00000017
0x0f20 0x00000019 0x0f24 0x0000001d 0x0f28 0x0000001d 0x0f2c 0x00000021
0x0f30 0x0000001c 0x0f34 0x00000013 0x0f38 0x00000010 0x0f3c 0x00000025
0x0f40 0x00000016 0x0f44 0x0000001c 0x0f48 0x00000015 0x0f4c 0x00000014
0x0f50 0x0000001c 0x0f54 0x00000010 0x0f58 0x0000001a 0x0f5c 0x00000012
0x0f60 0x00000013 0x0f64 0x00000016 0x0f68 0x00000018 0x0f6c 0x0000000e
0x0f70 0x0000000a 0x0f74 0x00000013 0x0f78 0x00000010 0x0f7c 0x00000019
0x0f80 0x00000019 0x0f84 0x00000014 0x0f88 0x00000015 0x0f8c 0x00000012
0x0f90 0x00000017 0x0f94 0x00000012 0x0f98 0x00000011 0x0f9c 0x00000015
0x0fa0 0x00000010 0x0fa4 0x00000019 0x0fa8 0x00000013 0x0fac 0x00000013
0x0fb0 0x00000015 0x0fb4 0x0000000e 0x0fb8 0x0000000d 0x0fbc 0x0000000e
0x0fc0 0x00000012 0x0fc4 0x00000011 0x0fc8 0x00000014 0x0fcc 0x00000011
0x0fd0 0x00000013 0x0fd4 0x0000000f 0x0fd8 0x0000000e 0x0fdc 0x0000000a
0x0fe0 0x0000000c 0x0fe4 0x0000000f 0x0fe8 0x0000000a 0x0fec 0x00000008
0x0ff0 0x0000000b 0x0ff4 0x00000011 0x0ff8 0x00000010 0x0ffc 0x0000000b
0x1000 0x0000000e 0x1004 0x0000000c 0x1008 0x0000000d 0x100c 0x0000000e
0x1010 0x00000007 0x1014 0x0000000a 0x1018 0x00000013 0x101c 0x00000013
0x1020 0x00000005 0x1024 0x00000007 0x1028 0x00000011 0x102c 0x00000017
0x1030 0x00000010 0x1034 0x00000009 0x1038 0x0000000b 0x103c 0x0000000f
0x1040 0x00000009 0x1044 0x0000000f 0x1048 0x0000000a 0x104c 0x00000009
0x1050 0x0000000c 0x1054 0x0000000b 0x1058 0x00000009 0x105c 0x00000007
0x1060 0x0000000b 0x1064 0x0000000b 0x1068 0x00000008 0x106c 0x00000006
0x1070 0x00000006 0x1074 0x00000010 0x1078 0x00000009 0x107c 0x00000006
0x1080 0x00000009 0x1084 0x00000002 0x1088 0x0000000b 0x108c 0x00000006
0x1090 0x0000000b 0x1094 0x00000009 0x1098 0x0000000b 0x109c 0x00000006
0x10a0 0x00000008 0x10a4 0x00000007 0x10a8 0x00000009 0x10ac 0x00000008
0x10b0 0x00000008 0x10b4 0x00000010 0x10b8 0x00000007 0x10bc 0x00000006
0x10c0 0x00000005 0x10c4 0x00000008 0x10c8 0x00000003 0x10cc 0x00000004
0x10d0 0x00000006 0x10d4 0x00000009 0x10d8 0x00000007 0x10dc 0x00000005
0x10e0 0x00000007 0x10e4 0x00000006 0x10e8 0x00000007 0x10ec 0x00000008
0x10f0 0x00000007 0x10f4 0x0000000c 0x10f8 0x00000006 0x10fc 0x00000008
0x1100 0x00000004 0x1104 0x0000000d 0x1108 0x00000008 0x110c 0x00000005
0x1110 0x00000008 0x1114 0x00000002 0x1118 0x00000007 0x111c 0x00000002
0x1120 0x0000000a 0x1124 0x00000007 0x1128 0x00000001 0x112c 0x00000006
0x1130 0x00000003 0x1134 0x00000002 0x1138 0x00000004 0x113c 0x00000005
0x1140 0x00000007 0x1144 0x00000003 0x1148 0x00000003 0x114c 0x00000002
0x1150 0x00000004 0x1154 0x00000006 0x1158 0x00000005 0x115c 0x00000004
0x1160 0x00000006 0x1164 0x00000003 0x1168 0x00000007 0x116c 0x00000005
0x1170 0x00000006 0x1174 0x00000008 0x1178 0x00000004 0x117c 0x00000005
0x1180 0x00000004 0x1184 0x00000006 0x1188 0x00000003 0x118c 0x00000008
0x1190 0x00000005 0x1194 0x00000004 0x1198 0x00000003 0x119c 0x00000004
0x11a0 0x00000002 0x11a4 0x00000003 0x11a8 0x00000003 0x11ac 0x00000005
0x11b0 0x00000006 0x11b4 0x00000007 0x11b8 0x00000002 0x11bc 0x00000003
0x11c0 0x00000005 0x11c4 0x00000003 0x11c8 0x00000004 0x11cc 0x00000001
0x11d0 0x00000002 0x11d4 0x00000002 0x11d8 0x00000004 0x11dc 0x00000001
0x11e0 0x00000003 0x11e4 0x00000001 0x11e8 0x00000008 0x11ec 0x00000004
0x11f0 0x00000002 0x11f4 0x00000002 0x11f8 0x00000003 0x11fc 0x00000003
0x1200 0x00000001 0x1204 0x00000008 0x1208 0x00000005 0x120c 0x00000003
0x1210 0x00000002 0x1214 0x00000002 0x1218 0x00000002 0x121c 0x00000004
0x1220 0x00000002 0x1224 0x00000002 0x1228 0x00000004 0x122c 0x00000003
0x1230 0x00000003 0x1234 0x00000001 0x1238 0x00000005 0x123c 0x00000004
0x1240 0x00000005 0x1244 0x00000002 0x1248 0x00000001 0x124c 0x00000004
0x1250 0x00000003 0x1254 0x00000003 0x1258 0x00000001 0x125c 0x00000004
0x1260 0x00000003 0x1264 0x00000001 0x1268 0x00000004 0x126c 0x00000002
0x1270 0x00000007 0x1274 0x00000003 0x1278 0x00000001 0x127c 0x00000001
0x1280 0x00000002 0x1284 0x00000004 0x1288 0x00000003 0x128c 0x00000002
0x1290 0x00000003 0x1294 0x00000004 0x1298 0x00000004 0x12a0 0x00000002
0x12a4 0x00000002 0x12a8 0x00000002 0x12ac 0x00000005 0x12b4 0x00000002
0x12b8 0x00000004 0x12bc 0x00000001 0x12c0 0x00000003 0x12c4 0x00000001
0x12cc 0x00000001 0x12d0 0x00000001 0x12d8 0x00000001 0x12dc 0x00000004
0x12e0 0x00000005 0x12e4 0x00000001 0x12e8 0x00000003 0x12ec 0x00000001
0x12f0 0x00000001 0x12f4 0x00000006 0x12f8 0x00000002 0x12fc 0x00000002
0x1300 0x00000004 0x1304 0x00000001 0x1308 0x00000001 0x130c 0x00000001
0x1310 0x00000001 0x1314 0x00000004 0x1318 0x00000002 0x1320 0x00000003
0x1324 0x00000002 0x132c 0x00000001 0x1330 0x00000002 0x1334 0x00000003
0x1338 0x00000003 0x1340 0x00000002 0x1344 0x00000002 0x1350 0x00000001
0x1354 0x00000002 0x1358 0x00000002 0x1360 0x00000003 0x1364 0x00000002
0x1368 0x00000002 0x136c 0x00000002 0x1370 0x00000003 0x1374 0x00000001
0x137c 0x00000002 0x1380 0x00000001 0x1384 0x00000002 0x1388 0x00000003
0x138c 0x00000001 0x1390 0x00000001 0x1394 0x00000001 0x13a0 0x00000002
0x13a4 0x00000001 0x13a8 0x00000002 0x13ac 0x00000001 0x13b8 0x00000002
0x13bc 0x00000001 0x13c4 0x00000004 0x13c8 0x00000005 0x13cc 0x00000001
0x13d4 0x00000002 0x13d8 0x00000001 0x13e0 0x00000004 0x13e8 0x00000002
0x13f0 0x00000001 0x13f4 0x00000001 0x13f8 0x00000002 0x13fc 0x00000001
0x1400 0x00000002 0x1408 0x00000001 0x1410 0x00000003 0x1418 0x00000001
0x1420 0x00000001 0x1424 0x00000002 0x1428 0x00000001 0x1434 0x00000001
0x143c 0x00000001 0x1440 0x00000003 0x1444 0x00000001 0x1448 0x00000001
0x1450 0x00000001 0x1458 0x00000001 0x1464 0x00000002 0x1468 0x00000001
0x146c 0x00000004 0x1474 0x00000001 0x1478 0x00000003 0x147c 0x00000002
0x148c 0x00000001 0x1490 0x00000001 0x1494 0x00000001 0x1498 0x00000001
0x14a0 0x00000001 0x14a4 0x00000002 0x14ac 0x00000002 0x14bc 0x00000003
0x14c4 0x00000001 0x14cc 0x00000001 0x14d0 0x00000002 0x14d4 0x00000001
0x14d8 0x00000001 0x14dc 0x00000001 0x14e8 0x00000002 0x14f0 0x00000002
0x1500 0x00000001 0x150c 0x00000001 0x1510 0x00000001 0x1514 0x00000002
0x151c 0x00000001 0x1520 0x00000001 0x1524 0x00000001 0x1528 0x00000001
0x152c 0x00000001 0x1530 0x00000001 0x153c 0x00000001 0x154c 0x00000001
0x155c 0x00000001 0x1560 0x00000003 0x1568 0x00000001 0x156c 0x00000001
0x1578 0x00000001 0x1580 0x00000001 0x1594 0x00000001 0x159c 0x00000001
0x15a0 0x00000001 0x15a4 0x00000001 0x15a8 0x00000001 0x15b4 0x00000002
0x15b8 0x00000001 0x15c0 0x00000002 0x15d0 0x00000001 0x15e0 0x00000002
0x15ec 0x00000001 0x160c 0x00000001 0x1610 0x00000001 0x1614 0x00000001
0x161c 0x00000001 0x1638 0x00000001 0x163c 0x00000001 0x1654 0x00000001
0x1658 0x00000001 0x1668 0x00000001 0x1670 0x00000001 0x1678 0x00000002
0x168c 0x00000002 0x1690 0x00000001 0x16a0 0x00000001 0x16b0 0x00000001
0x16c0 0x00000001 0x16e8 0x00000001 0x16ec 0x00000002 0x16f4 0x00000001
0x1710 0x00000001 0x1718 0x00000001 0x1730 0x00000001 0x1740 0x00000002
0x1748 0x00000001 0x175c 0x00000001 0x1760 0x00000002 0x178c 0x00000001
0x17ac 0x00000001 0x17d0 0x00000001 0x17d8 0x00000001 0x17e0 0x00000001
0x1810 0x00000001 0x1820 0x00000001 0x1824 0x00000001 0x1838 0x00000001
0x1850 0x00000001 0x1854 0x00000001 0x1858 0x00000001 0x1868 0x00000001
0x188c 0x00000001 0x1894 0x00000001 0x18cc 0x00000001 0x18d8 0x00000001
0x18f8 0x00000001 0x18fc 0x00000001 0x1900 0x00000001 0x1954 0x00000003
0x196c 0x00000001 0x19d8 0x00000001 0x19e4 0x00000003 0x1a50 0x00000001
0x1a64 0x00000001 0x1ab0 0x00000001 0x1ac8 0x00000001 0x1ae4 0x00000001
0x1b20 0x00000001 0x1b60 0x00000001 0x1b64 0x00000001 0x1be4 0x00000001
0x1c5c 0x00000001 0x1ce0 0x00000001

The last entry is for 1848 pending packets.

I've done some other measurements:

tg3_rx_prodring_xfer() very rarely copied more than 64 items or
finds a 'busy' slot (ie err set).

Processing 4 rx packets for every napi budget doesn't make any
difference.

It seems like there can be a significant delay between the hardware
interrupt and the napi/softint.

I'm not sure of how to instrument that though.
But it is likely to affect other ethernet drivers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-19  8:44       ` David Laight
  2022-05-19 10:20         ` Pavan Chebbi
@ 2022-05-22 23:22         ` Michael Chan
  1 sibling, 0 replies; 20+ messages in thread
From: Michael Chan @ 2022-05-22 23:22 UTC (permalink / raw)
  To: David Laight; +Cc: Paolo Abeni, netdev, mchan, David Miller, Pavan Chebbi

[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]

On Thu, May 19, 2022 at 1:44 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Michael Chan
> > Sent: 19 May 2022 01:52
> > Please show a snapshot of all the counters.  In particular,
> > rxbds_empty, rx_discards, etc will show whether the driver is keeping
> > up with incoming RX packets or not.
>
> After running the test for a short time.
> The application stats indicate that around 40000 packets are missing.
>
> # ethtool -S em2 | grep -v ' 0$'; for f in /sys/class/net/em2/statistics/*; do echo $f $(cat $f); done|grep -v ' 0$'
> NIC statistics:
>      rx_octets: 4589028558
>      rx_ucast_packets: 21049866
>      rx_mcast_packets: 763
>      rx_bcast_packets: 746
>      tx_octets: 4344
>      tx_ucast_packets: 6
>      tx_mcast_packets: 40
>      tx_bcast_packets: 3
>      rxbds_empty: 76

This means that the RX ring became empty (with no RX BDs) 76 times.
It doesn't necessarily mean packets were dropped yet because the chip
will continue to buffer using on-chip buffers.

>      rx_discards: 14

But this means that 14 packets were dropped because on-chip buffers
were full.  I don't know why it is not matching the 40000 missing
packets reported by the application.

>      mbuf_lwm_thresh_hit: 14

This means that the low on-chip buffer threshold was crossed 14 times.
This should trigger pause frames to be sent if flow control is
enabled.

> /sys/class/net/em2/statistics/multicast 763
> /sys/class/net/em2/statistics/rx_bytes 4589028558
> /sys/class/net/em2/statistics/rx_missed_errors 14
> /sys/class/net/em2/statistics/rx_packets 21433169
> /sys/class/net/em2/statistics/tx_bytes 4344
> /sys/class/net/em2/statistics/tx_packets 49
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-20 16:08                     ` David Laight
@ 2022-05-23 16:01                       ` David Laight
  2022-05-23 16:14                         ` Pavan Chebbi
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-23 16:01 UTC (permalink / raw)
  To: 'Pavan Chebbi'
  Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

Ok todays test is calling tracing_off() when the napi callback
has more than 1000 packets queued.

This is rather informative.
Just taking the trace for cpu 32 and shortening the lines somewhat
gives the following.

<idle>-0  6.594695: irq_handler_entry: irq=87 name=em2-rx-2
<idle>-0  6.594695: napi_schedule_prep <-tg3_msi_1shot
<idle>-0  6.594695: __napi_schedule <-tg3_msi_1shot
<idle>-0  6.594695: irq_handler_exit: irq=87 ret=handled
<idle>-0  6.594695: softirq_entry: vec=3 [action=NET_RX]
<idle>-0  6.594696: napi_schedule_prep <-tg3_rx
<idle>-0  6.594696: __napi_schedule <-tg3_rx
<idle>-0  6.594697: napi_poll: napi poll on napi struct 06e44eda for device em2 work 1 budget 64
<idle>-0  6.594697: softirq_exit: vec=3 [action=NET_RX]
<idle>-0  6.594697: softirq_entry: vec=3 [action=NET_RX]
<idle>-0  6.594698: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
<idle>-0  6.594698: softirq_exit: vec=3 [action=NET_RX]
<idle>-0  6.594700: irq_handler_entry: irq=87 name=em2-rx-2
<idle>-0  6.594701: napi_schedule_prep <-tg3_msi_1shot
<idle>-0  6.594701: __napi_schedule <-tg3_msi_1shot
<idle>-0  6.594701: irq_handler_exit: irq=87 ret=handled
<idle>-0  6.594701: softirq_entry: vec=3 [action=NET_RX]
<idle>-0  6.594704: napi_schedule_prep <-tg3_rx
<idle>-0  6.594704: __napi_schedule <-tg3_rx
<idle>-0  6.594705: napi_poll: napi poll on napi struct 06e44eda for device em2 work 3 budget 64
<idle>-0  6.594706: softirq_exit: vec=3 [action=NET_RX]
<idle>-0  6.594710: irq_handler_entry: irq=87 name=em2-rx-2
<idle>-0  6.594710: napi_schedule_prep <-tg3_msi_1shot
<idle>-0  6.594710: __napi_schedule <-tg3_msi_1shot
<idle>-0  6.594710: irq_handler_exit: irq=87 ret=handled
<idle>-0  6.594712: sched_switch: prev_pid=0 prev_prio=120 prev_state=R ==> next_pid=2275 next_prio=49
pid-2275  6.594720: sys_futex(uaddr: 7fbd2bfe3a88, op: 81, val: 1, utime: 1064720, uaddr2: 0, val3: 27bea)
pid-2275  6.594721: sys_futex -> 0x0
pid-2275  6.598067: sys_epoll_wait(epfd: 61, events: 7fbd2bfe3300, maxevents: 80, timeout: 0)
pid-2275  6.598747: sched_switch: prev_pid=2275 prev_prio=49 prev_state=S ==> next_pid=175 next_prio=120
pid-175   6.598759: sched_switch: prev_pid=175 prev_prio=120 prev_state=R ==> next_pid=819 next_prio=120
pid-819   6.598763: sched_switch: prev_pid=819 prev_prio=120 prev_state=I ==> next_pid=175 next_prio=120
pid-175   6.598765: softirq_entry: vec=3 [action=NET_RX]
pid-175   6.598770: __napi_schedule <-napi_complete_done
pid-175   6.598771: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64

The first few interrupts look fine.
Then there is a 4ms delay between the irq_handler caling napi_schedule()
and the NET_RX function actually being called.

Except that isn't the actual delay.
There are 3 relevant napi structures.
06e44eda (horrid hash confusing things) for the rx ring of irq=87.
xxxxxxxx (not in the above trace) for the other active rx ring.
909def03 for em2-rx-1 that is also used to copy used rx descriptors
back to the 'free buffer' ring.

So the normal sequence is the hard_irq schedules the 'napi' for that ring.
Max 64 packets are processed (RPS cross schedules them).
Then the 'napi' for ring em2-rx-0 is scheduled.
It is this napi that is being reported complete.
Then the em2-rx-2 napi is called and finds 1000+ items on the status ring.

During the 4ms gap em2-rx-4 is being interrupted and the napi is
processing receive packets - but the shared napi (909def03) isn't run.
(Earlier in the trace it gets scheduled by both ISR.)

pid 175 is ksoftirqd/32 and 819 kworker/32:2.
pid 2275 is part of our application.
It is possible that it looped in userspace for a few ms.

But the napi_schedule() at 6.594704 should have called back as
909def03 immediately.

I've got a lot more ftrace - but it is large and mostly boring.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-23 16:01                       ` David Laight
@ 2022-05-23 16:14                         ` Pavan Chebbi
  2022-05-23 21:23                           ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Pavan Chebbi @ 2022-05-23 16:14 UTC (permalink / raw)
  To: David Laight; +Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

[-- Attachment #1: Type: text/plain, Size: 4904 bytes --]

From the register dump we can see that your device is 5720-A0.
There is a patch that was applied long time ago:
4d95847381228639844c7197deb8b2211274ef22- tg3: Workaround rx_discards stat bug

This patch could be making the rx_discards count not report the actual discards.
You can see the actual discards if you monitor the register at 0x2250,
which I think matches with
your application report. Can you test this and confirm?

We can also see from your register dump shared earlier that the Flow
Attention Register at 0x3C48 is indicating
bit 6 set. That means the MBUF allocation state machine has reached
mbuf low water threshold.
In summary at this point it appears like the host cannot keep up with
incoming traffic.


On Mon, May 23, 2022 at 9:31 PM David Laight <David.Laight@aculab.com> wrote:
>
> Ok todays test is calling tracing_off() when the napi callback
> has more than 1000 packets queued.
>
> This is rather informative.
> Just taking the trace for cpu 32 and shortening the lines somewhat
> gives the following.
>
> <idle>-0  6.594695: irq_handler_entry: irq=87 name=em2-rx-2
> <idle>-0  6.594695: napi_schedule_prep <-tg3_msi_1shot
> <idle>-0  6.594695: __napi_schedule <-tg3_msi_1shot
> <idle>-0  6.594695: irq_handler_exit: irq=87 ret=handled
> <idle>-0  6.594695: softirq_entry: vec=3 [action=NET_RX]
> <idle>-0  6.594696: napi_schedule_prep <-tg3_rx
> <idle>-0  6.594696: __napi_schedule <-tg3_rx
> <idle>-0  6.594697: napi_poll: napi poll on napi struct 06e44eda for device em2 work 1 budget 64
> <idle>-0  6.594697: softirq_exit: vec=3 [action=NET_RX]
> <idle>-0  6.594697: softirq_entry: vec=3 [action=NET_RX]
> <idle>-0  6.594698: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
> <idle>-0  6.594698: softirq_exit: vec=3 [action=NET_RX]
> <idle>-0  6.594700: irq_handler_entry: irq=87 name=em2-rx-2
> <idle>-0  6.594701: napi_schedule_prep <-tg3_msi_1shot
> <idle>-0  6.594701: __napi_schedule <-tg3_msi_1shot
> <idle>-0  6.594701: irq_handler_exit: irq=87 ret=handled
> <idle>-0  6.594701: softirq_entry: vec=3 [action=NET_RX]
> <idle>-0  6.594704: napi_schedule_prep <-tg3_rx
> <idle>-0  6.594704: __napi_schedule <-tg3_rx
> <idle>-0  6.594705: napi_poll: napi poll on napi struct 06e44eda for device em2 work 3 budget 64
> <idle>-0  6.594706: softirq_exit: vec=3 [action=NET_RX]
> <idle>-0  6.594710: irq_handler_entry: irq=87 name=em2-rx-2
> <idle>-0  6.594710: napi_schedule_prep <-tg3_msi_1shot
> <idle>-0  6.594710: __napi_schedule <-tg3_msi_1shot
> <idle>-0  6.594710: irq_handler_exit: irq=87 ret=handled
> <idle>-0  6.594712: sched_switch: prev_pid=0 prev_prio=120 prev_state=R ==> next_pid=2275 next_prio=49
> pid-2275  6.594720: sys_futex(uaddr: 7fbd2bfe3a88, op: 81, val: 1, utime: 1064720, uaddr2: 0, val3: 27bea)
> pid-2275  6.594721: sys_futex -> 0x0
> pid-2275  6.598067: sys_epoll_wait(epfd: 61, events: 7fbd2bfe3300, maxevents: 80, timeout: 0)
> pid-2275  6.598747: sched_switch: prev_pid=2275 prev_prio=49 prev_state=S ==> next_pid=175 next_prio=120
> pid-175   6.598759: sched_switch: prev_pid=175 prev_prio=120 prev_state=R ==> next_pid=819 next_prio=120
> pid-819   6.598763: sched_switch: prev_pid=819 prev_prio=120 prev_state=I ==> next_pid=175 next_prio=120
> pid-175   6.598765: softirq_entry: vec=3 [action=NET_RX]
> pid-175   6.598770: __napi_schedule <-napi_complete_done
> pid-175   6.598771: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
>
> The first few interrupts look fine.
> Then there is a 4ms delay between the irq_handler caling napi_schedule()
> and the NET_RX function actually being called.
>
> Except that isn't the actual delay.
> There are 3 relevant napi structures.
> 06e44eda (horrid hash confusing things) for the rx ring of irq=87.
> xxxxxxxx (not in the above trace) for the other active rx ring.
> 909def03 for em2-rx-1 that is also used to copy used rx descriptors
> back to the 'free buffer' ring.
>
> So the normal sequence is the hard_irq schedules the 'napi' for that ring.
> Max 64 packets are processed (RPS cross schedules them).
> Then the 'napi' for ring em2-rx-0 is scheduled.
> It is this napi that is being reported complete.
> Then the em2-rx-2 napi is called and finds 1000+ items on the status ring.
>
> During the 4ms gap em2-rx-4 is being interrupted and the napi is
> processing receive packets - but the shared napi (909def03) isn't run.
> (Earlier in the trace it gets scheduled by both ISR.)
>
> pid 175 is ksoftirqd/32 and 819 kworker/32:2.
> pid 2275 is part of our application.
> It is possible that it looped in userspace for a few ms.
>
> But the napi_schedule() at 6.594704 should have called back as
> 909def03 immediately.
>
> I've got a lot more ftrace - but it is large and mostly boring.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-23 16:14                         ` Pavan Chebbi
@ 2022-05-23 21:23                           ` David Laight
  2022-05-25  7:28                             ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-23 21:23 UTC (permalink / raw)
  To: 'Pavan Chebbi'
  Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

From: Pavan Chebbi
> Sent: 23 May 2022 17:15
> 
> From the register dump we can see that your device is 5720-A0.
> There is a patch that was applied long time ago:
> 4d95847381228639844c7197deb8b2211274ef22- tg3: Workaround rx_discards stat bug
> 
> This patch could be making the rx_discards count not report the actual discards.
> You can see the actual discards if you monitor the register at 0x2250,
> which I think matches with
> your application report. Can you test this and confirm?

I'll revert that patch and see if the error counts then make sense.

> We can also see from your register dump shared earlier that the Flow
> Attention Register at 0x3C48 is indicating
> bit 6 set. That means the MBUF allocation state machine has reached
> mbuf low water threshold.
> In summary at this point it appears like the host cannot keep up with
> incoming traffic.

As the trace below shows I think the underlying problem
is that the napi callbacks aren't being made in a timely manner.

Overall there is plenty of idle cpu time and the application workload
shouldn't really cause any issues to the kernel.
My ftrace contains all the syscall entry and exit.
The only ones are futex sleep/wakeup, epoll_wait and recvmsg.

From what I've seen before, the softint and napi callback usually
happen when the hardint returns.
In this case I think that happens, but when the napi callback for
ring 2 tries to schedule the callback for ring 0 (to move the
free buffers) something blocks all the napi callbacks (probably
all softints) on that cpu for several milliseconds.

This actually looks like a generic problem, not a tg3 one.

	David

> 
> 
> On Mon, May 23, 2022 at 9:31 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > Ok todays test is calling tracing_off() when the napi callback
> > has more than 1000 packets queued.
> >
> > This is rather informative.
> > Just taking the trace for cpu 32 and shortening the lines somewhat
> > gives the following.
> >
> > <idle>-0  6.594695: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594695: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594695: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594695: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594695: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594696: napi_schedule_prep <-tg3_rx
> > <idle>-0  6.594696: __napi_schedule <-tg3_rx
> > <idle>-0  6.594697: napi_poll: napi poll on napi struct 06e44eda for device em2 work 1 budget 64
> > <idle>-0  6.594697: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594697: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594698: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
> > <idle>-0  6.594698: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594700: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594701: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594701: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594701: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594701: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594704: napi_schedule_prep <-tg3_rx
> > <idle>-0  6.594704: __napi_schedule <-tg3_rx
> > <idle>-0  6.594705: napi_poll: napi poll on napi struct 06e44eda for device em2 work 3 budget 64
> > <idle>-0  6.594706: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594710: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594710: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594710: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594710: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594712: sched_switch: prev_pid=0 prev_prio=120 prev_state=R ==> next_pid=2275
> next_prio=49
> > pid-2275  6.594720: sys_futex(uaddr: 7fbd2bfe3a88, op: 81, val: 1, utime: 1064720, uaddr2: 0, val3:
> 27bea)
> > pid-2275  6.594721: sys_futex -> 0x0
> > pid-2275  6.598067: sys_epoll_wait(epfd: 61, events: 7fbd2bfe3300, maxevents: 80, timeout: 0)
> > pid-2275  6.598747: sched_switch: prev_pid=2275 prev_prio=49 prev_state=S ==> next_pid=175
> next_prio=120
> > pid-175   6.598759: sched_switch: prev_pid=175 prev_prio=120 prev_state=R ==> next_pid=819
> next_prio=120
> > pid-819   6.598763: sched_switch: prev_pid=819 prev_prio=120 prev_state=I ==> next_pid=175
> next_prio=120
> > pid-175   6.598765: softirq_entry: vec=3 [action=NET_RX]
> > pid-175   6.598770: __napi_schedule <-napi_complete_done
> > pid-175   6.598771: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
> >
> > The first few interrupts look fine.
> > Then there is a 4ms delay between the irq_handler caling napi_schedule()
> > and the NET_RX function actually being called.
> >
> > Except that isn't the actual delay.
> > There are 3 relevant napi structures.
> > 06e44eda (horrid hash confusing things) for the rx ring of irq=87.
> > xxxxxxxx (not in the above trace) for the other active rx ring.
> > 909def03 for em2-rx-1 that is also used to copy used rx descriptors
> > back to the 'free buffer' ring.
> >
> > So the normal sequence is the hard_irq schedules the 'napi' for that ring.
> > Max 64 packets are processed (RPS cross schedules them).
> > Then the 'napi' for ring em2-rx-0 is scheduled.
> > It is this napi that is being reported complete.
> > Then the em2-rx-2 napi is called and finds 1000+ items on the status ring.
> >
> > During the 4ms gap em2-rx-4 is being interrupted and the napi is
> > processing receive packets - but the shared napi (909def03) isn't run.
> > (Earlier in the trace it gets scheduled by both ISR.)
> >
> > pid 175 is ksoftirqd/32 and 819 kworker/32:2.
> > pid 2275 is part of our application.
> > It is possible that it looped in userspace for a few ms.
> >
> > But the napi_schedule() at 6.594704 should have called back as
> > 909def03 immediately.
> >
> > I've got a lot more ftrace - but it is large and mostly boring.
> >
> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-23 21:23                           ` David Laight
@ 2022-05-25  7:28                             ` David Laight
  2022-05-25 15:56                               ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2022-05-25  7:28 UTC (permalink / raw)
  To: David Laight, 'Pavan Chebbi'
  Cc: Paolo Abeni, Michael Chan, netdev, mchan, David Miller

From: David Laight
> Sent: 23 May 2022 22:23
...
> As the trace below shows I think the underlying problem
> is that the napi callbacks aren't being made in a timely manner.

Further investigations have shown that this is actually
a generic problem with the way napi callbacks are called
from the softint handler.

The underlying problem is the effect of this code
in __do_softirq().

        pending = local_softirq_pending();
        if (pending) {
                if (time_before(jiffies, end) && !need_resched() &&
                    --max_restart)
                        goto restart;

                wakeup_softirqd();
        }

The napi processing can loop through here and needs to do
the 'goto restart' - not doing so will drop packets.
The need_resched() test is particularly troublesome.
I've also had to increase the limit for 'max_restart' from
its (hard coded) 10 to 1000 (100 isn't enough).
I'm not sure whether I'm hitting the jiffies limit,
but that is hard coded at 2.

I'm going to start another thread.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tg3 dropping packets at high packet rates
  2022-05-25  7:28                             ` David Laight
@ 2022-05-25 15:56                               ` Jakub Kicinski
  2022-05-25 21:48                                 ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2022-05-25 15:56 UTC (permalink / raw)
  To: David Laight
  Cc: 'Pavan Chebbi',
	Paolo Abeni, Michael Chan, netdev, mchan, David Miller

On Wed, 25 May 2022 07:28:42 +0000 David Laight wrote:
> > As the trace below shows I think the underlying problem
> > is that the napi callbacks aren't being made in a timely manner.  
> 
> Further investigations have shown that this is actually
> a generic problem with the way napi callbacks are called
> from the softint handler.
> 
> The underlying problem is the effect of this code
> in __do_softirq().
> 
>         pending = local_softirq_pending();
>         if (pending) {
>                 if (time_before(jiffies, end) && !need_resched() &&
>                     --max_restart)
>                         goto restart;
> 
>                 wakeup_softirqd();
>         }
> 
> The napi processing can loop through here and needs to do
> the 'goto restart' - not doing so will drop packets.
> The need_resched() test is particularly troublesome.
> I've also had to increase the limit for 'max_restart' from
> its (hard coded) 10 to 1000 (100 isn't enough).
> I'm not sure whether I'm hitting the jiffies limit,
> but that is hard coded at 2.
> 
> I'm going to start another thread.

If you share the core between the application and NAPI try using prefer
busy polling (SO_PREFER_BUSY_POLL), and manage polling from user space.
If you have separate cores use threaded NAPI and isolate the core
running NAPI or give it high prio.

YMMV but I've spent more time than I'd like to admit looking at the
softirq yielding conditions, they are hard to beat :( If you control
the app much better use of your time to arrange busy poll or pin things.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: tg3 dropping packets at high packet rates
  2022-05-25 15:56                               ` Jakub Kicinski
@ 2022-05-25 21:48                                 ` David Laight
  0 siblings, 0 replies; 20+ messages in thread
From: David Laight @ 2022-05-25 21:48 UTC (permalink / raw)
  To: 'Jakub Kicinski'
  Cc: 'Pavan Chebbi',
	Paolo Abeni, Michael Chan, netdev, mchan, David Miller

From: Jakub Kicinski
> Sent: 25 May 2022 16:57
> 
> On Wed, 25 May 2022 07:28:42 +0000 David Laight wrote:
> > > As the trace below shows I think the underlying problem
> > > is that the napi callbacks aren't being made in a timely manner.
> >
> > Further investigations have shown that this is actually
> > a generic problem with the way napi callbacks are called
> > from the softint handler.
> >
> > The underlying problem is the effect of this code
> > in __do_softirq().
> >
> >         pending = local_softirq_pending();
> >         if (pending) {
> >                 if (time_before(jiffies, end) && !need_resched() &&
> >                     --max_restart)
> >                         goto restart;
> >
> >                 wakeup_softirqd();
> >         }
> >
> > The napi processing can loop through here and needs to do
> > the 'goto restart' - not doing so will drop packets.
> > The need_resched() test is particularly troublesome.
> > I've also had to increase the limit for 'max_restart' from
> > its (hard coded) 10 to 1000 (100 isn't enough).
> > I'm not sure whether I'm hitting the jiffies limit,
> > but that is hard coded at 2.
> >
> > I'm going to start another thread.
> 
> If you share the core between the application and NAPI try using prefer
> busy polling (SO_PREFER_BUSY_POLL), and manage polling from user space.
> If you have separate cores use threaded NAPI and isolate the core
> running NAPI or give it high prio.

The application is looking at 10000 UDP sockets each of which
typically has 1 packet every 20ms (but there might be an extra
one).
About the only way to handle this is with an array of 100 epoll
fd each of which has 100 UDP sockets.
Every 10ms (we do our RTP in 10ms epochs) each application thread
picks the next epoll fd (using atomic_in_ov()) and then reads all
the 'ready' sockets.
The application can't afford so take a mutex in any hot path
because mutex contention can happen while the process is 'stolen'
by a hardware interrupt and/or softint.
That then stalls all the waiting application threads.

Even then I've got 35 application threads that call epoll_wait()
and recvfrom() and run at about 50% cpu.

The ethernet napi code is using about 50% of two cpu.
I'm using RPS to move the IP/UDP processing to other cpu.
(Manually avoiding the ones taking the hardware interrupts.)

> YMMV but I've spent more time than I'd like to admit looking at the
> softirq yielding conditions, they are hard to beat :(

I've spent a long time discovering it is one reason I'm losing
a lot of packets on a system with a reasonable amount of idle time.

> If you control
> the app much better use of your time to arrange busy poll or pin things.

Pinning things gets annoying.
I've been running the 'important' application threads under the
RT scheduler. This makes their cpu assignment very sticky.
So nearly as good a pinning but the scheduler decides where they go.

This afternoon I tried using threaded napi for the ethernet interface.
(Suggested by Eric.)
This can only really work if the threads run under the RT scheduler.
I can't see an easy way to do this, apart from:
  (cd /proc; for pid in [1-9]*; do
	comm=$(cat $pid/comm);
	[ "${comm#napi/}" != "$comm" ] && chrt --pid 50 $pid;
  done;)
Since I was only running 35 RT application threads the extra 5
napi ones is exactly one for each cpu and AFAICT they all run
on separate cpu.

With threaded napi (and RPS) I'm only seeing 250 (or so)
busy ethernet ring entries in the napi code - not the 1000+
I was getting with the default __do_softirq() code.
Similar to stopping the softint code falling back to its thread.
I'm still losing packets though (sometimes over 100/sec)
not sure if the hardware has no free ring buffers or
whether the switch is dropping some of them.

Apart from python (which needs pinning to a single cpu) I'm not
sure how much effect pinning a normal priority process to a
cpu has - unless you also pin more general processes away from
that cpu.
Running under the RT scheduler (provided you don't create too
many RT processes) sort of gives each one its own cpu while
allowing other processes to use the cpu when it is idle.
Straight forward pinning of processes doesn't do that.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-05-25 21:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18 16:08 tg3 dropping packets at high packet rates David Laight
2022-05-18 17:27 ` Paolo Abeni
2022-05-18 21:31   ` David Laight
2022-05-19  0:52     ` Michael Chan
2022-05-19  8:44       ` David Laight
2022-05-19 10:20         ` Pavan Chebbi
2022-05-19 13:14           ` David Laight
2022-05-19 13:29             ` Paolo Abeni
2022-05-19 13:54               ` Andrew Lunn
2022-05-19 14:11               ` David Laight
2022-05-19 14:35                 ` Pavan Chebbi
2022-05-19 14:42                   ` David Laight
2022-05-20 16:08                     ` David Laight
2022-05-23 16:01                       ` David Laight
2022-05-23 16:14                         ` Pavan Chebbi
2022-05-23 21:23                           ` David Laight
2022-05-25  7:28                             ` David Laight
2022-05-25 15:56                               ` Jakub Kicinski
2022-05-25 21:48                                 ` David Laight
2022-05-22 23:22         ` Michael Chan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).