RE: tg3 dropping packets at high packet rates

From: David Laight <David.Laight@ACULAB.COM>
To: 'Pavan Chebbi' <pavan.chebbi@broadcom.com>
Cc: Paolo Abeni <pabeni@redhat.com>,
	Michael Chan <michael.chan@broadcom.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"mchan@broadcom.com" <mchan@broadcom.com>,
	David Miller <davem@davemloft.net>
Subject: RE: tg3 dropping packets at high packet rates
Date: Mon, 23 May 2022 21:23:10 +0000	[thread overview]
Message-ID: <71de7bfbb0854449bce509d67e9cf58c@AcuMS.aculab.com> (raw)
In-Reply-To: <CALs4sv13Y7CoMvrYm2c58vP6FKyK+_qrSp2UBCv0MURTAkv8hg@mail.gmail.com>

From: Pavan Chebbi
> Sent: 23 May 2022 17:15
> 
> From the register dump we can see that your device is 5720-A0.
> There is a patch that was applied long time ago:
> 4d95847381228639844c7197deb8b2211274ef22- tg3: Workaround rx_discards stat bug
> 
> This patch could be making the rx_discards count not report the actual discards.
> You can see the actual discards if you monitor the register at 0x2250,
> which I think matches with
> your application report. Can you test this and confirm?

I'll revert that patch and see if the error counts then make sense.

> We can also see from your register dump shared earlier that the Flow
> Attention Register at 0x3C48 is indicating
> bit 6 set. That means the MBUF allocation state machine has reached
> mbuf low water threshold.
> In summary at this point it appears like the host cannot keep up with
> incoming traffic.

As the trace below shows I think the underlying problem
is that the napi callbacks aren't being made in a timely manner.

Overall there is plenty of idle cpu time and the application workload
shouldn't really cause any issues to the kernel.
My ftrace contains all the syscall entry and exit.
The only ones are futex sleep/wakeup, epoll_wait and recvmsg.

From what I've seen before, the softint and napi callback usually
happen when the hardint returns.
In this case I think that happens, but when the napi callback for
ring 2 tries to schedule the callback for ring 0 (to move the
free buffers) something blocks all the napi callbacks (probably
all softints) on that cpu for several milliseconds.

This actually looks like a generic problem, not a tg3 one.

	David

> 
> 
> On Mon, May 23, 2022 at 9:31 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > Ok todays test is calling tracing_off() when the napi callback
> > has more than 1000 packets queued.
> >
> > This is rather informative.
> > Just taking the trace for cpu 32 and shortening the lines somewhat
> > gives the following.
> >
> > <idle>-0  6.594695: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594695: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594695: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594695: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594695: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594696: napi_schedule_prep <-tg3_rx
> > <idle>-0  6.594696: __napi_schedule <-tg3_rx
> > <idle>-0  6.594697: napi_poll: napi poll on napi struct 06e44eda for device em2 work 1 budget 64
> > <idle>-0  6.594697: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594697: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594698: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
> > <idle>-0  6.594698: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594700: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594701: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594701: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594701: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594701: softirq_entry: vec=3 [action=NET_RX]
> > <idle>-0  6.594704: napi_schedule_prep <-tg3_rx
> > <idle>-0  6.594704: __napi_schedule <-tg3_rx
> > <idle>-0  6.594705: napi_poll: napi poll on napi struct 06e44eda for device em2 work 3 budget 64
> > <idle>-0  6.594706: softirq_exit: vec=3 [action=NET_RX]
> > <idle>-0  6.594710: irq_handler_entry: irq=87 name=em2-rx-2
> > <idle>-0  6.594710: napi_schedule_prep <-tg3_msi_1shot
> > <idle>-0  6.594710: __napi_schedule <-tg3_msi_1shot
> > <idle>-0  6.594710: irq_handler_exit: irq=87 ret=handled
> > <idle>-0  6.594712: sched_switch: prev_pid=0 prev_prio=120 prev_state=R ==> next_pid=2275
> next_prio=49
> > pid-2275  6.594720: sys_futex(uaddr: 7fbd2bfe3a88, op: 81, val: 1, utime: 1064720, uaddr2: 0, val3:
> 27bea)
> > pid-2275  6.594721: sys_futex -> 0x0
> > pid-2275  6.598067: sys_epoll_wait(epfd: 61, events: 7fbd2bfe3300, maxevents: 80, timeout: 0)
> > pid-2275  6.598747: sched_switch: prev_pid=2275 prev_prio=49 prev_state=S ==> next_pid=175
> next_prio=120
> > pid-175   6.598759: sched_switch: prev_pid=175 prev_prio=120 prev_state=R ==> next_pid=819
> next_prio=120
> > pid-819   6.598763: sched_switch: prev_pid=819 prev_prio=120 prev_state=I ==> next_pid=175
> next_prio=120
> > pid-175   6.598765: softirq_entry: vec=3 [action=NET_RX]
> > pid-175   6.598770: __napi_schedule <-napi_complete_done
> > pid-175   6.598771: napi_poll: napi poll on napi struct 909def03 for device em2 work 0 budget 64
> >
> > The first few interrupts look fine.
> > Then there is a 4ms delay between the irq_handler caling napi_schedule()
> > and the NET_RX function actually being called.
> >
> > Except that isn't the actual delay.
> > There are 3 relevant napi structures.
> > 06e44eda (horrid hash confusing things) for the rx ring of irq=87.
> > xxxxxxxx (not in the above trace) for the other active rx ring.
> > 909def03 for em2-rx-1 that is also used to copy used rx descriptors
> > back to the 'free buffer' ring.
> >
> > So the normal sequence is the hard_irq schedules the 'napi' for that ring.
> > Max 64 packets are processed (RPS cross schedules them).
> > Then the 'napi' for ring em2-rx-0 is scheduled.
> > It is this napi that is being reported complete.
> > Then the em2-rx-2 napi is called and finds 1000+ items on the status ring.
> >
> > During the 4ms gap em2-rx-4 is being interrupted and the napi is
> > processing receive packets - but the shared napi (909def03) isn't run.
> > (Earlier in the trace it gets scheduled by both ISR.)
> >
> > pid 175 is ksoftirqd/32 and 819 kworker/32:2.
> > pid 2275 is part of our application.
> > It is possible that it looped in userspace for a few ms.
> >
> > But the napi_schedule() at 6.594704 should have called back as
> > 909def03 immediately.
> >
> > I've got a lot more ftrace - but it is large and mostly boring.
> >
> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)