RE: tg3 dropping packets at high packet rates

From: David Laight <David.Laight@ACULAB.COM>
To: 'Jakub Kicinski' <kuba@kernel.org>
Cc: 'Pavan Chebbi' <pavan.chebbi@broadcom.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Michael Chan <michael.chan@broadcom.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"mchan@broadcom.com" <mchan@broadcom.com>,
	David Miller <davem@davemloft.net>
Subject: RE: tg3 dropping packets at high packet rates
Date: Wed, 25 May 2022 21:48:15 +0000	[thread overview]
Message-ID: <4ff290b53a3945669f2057eddd8441b2@AcuMS.aculab.com> (raw)
In-Reply-To: <20220525085647.6dfb7ed0@kernel.org>

From: Jakub Kicinski
> Sent: 25 May 2022 16:57
> 
> On Wed, 25 May 2022 07:28:42 +0000 David Laight wrote:
> > > As the trace below shows I think the underlying problem
> > > is that the napi callbacks aren't being made in a timely manner.
> >
> > Further investigations have shown that this is actually
> > a generic problem with the way napi callbacks are called
> > from the softint handler.
> >
> > The underlying problem is the effect of this code
> > in __do_softirq().
> >
> >         pending = local_softirq_pending();
> >         if (pending) {
> >                 if (time_before(jiffies, end) && !need_resched() &&
> >                     --max_restart)
> >                         goto restart;
> >
> >                 wakeup_softirqd();
> >         }
> >
> > The napi processing can loop through here and needs to do
> > the 'goto restart' - not doing so will drop packets.
> > The need_resched() test is particularly troublesome.
> > I've also had to increase the limit for 'max_restart' from
> > its (hard coded) 10 to 1000 (100 isn't enough).
> > I'm not sure whether I'm hitting the jiffies limit,
> > but that is hard coded at 2.
> >
> > I'm going to start another thread.
> 
> If you share the core between the application and NAPI try using prefer
> busy polling (SO_PREFER_BUSY_POLL), and manage polling from user space.
> If you have separate cores use threaded NAPI and isolate the core
> running NAPI or give it high prio.

The application is looking at 10000 UDP sockets each of which
typically has 1 packet every 20ms (but there might be an extra
one).
About the only way to handle this is with an array of 100 epoll
fd each of which has 100 UDP sockets.
Every 10ms (we do our RTP in 10ms epochs) each application thread
picks the next epoll fd (using atomic_in_ov()) and then reads all
the 'ready' sockets.
The application can't afford so take a mutex in any hot path
because mutex contention can happen while the process is 'stolen'
by a hardware interrupt and/or softint.
That then stalls all the waiting application threads.

Even then I've got 35 application threads that call epoll_wait()
and recvfrom() and run at about 50% cpu.

The ethernet napi code is using about 50% of two cpu.
I'm using RPS to move the IP/UDP processing to other cpu.
(Manually avoiding the ones taking the hardware interrupts.)

> YMMV but I've spent more time than I'd like to admit looking at the
> softirq yielding conditions, they are hard to beat :(

I've spent a long time discovering it is one reason I'm losing
a lot of packets on a system with a reasonable amount of idle time.

> If you control
> the app much better use of your time to arrange busy poll or pin things.

Pinning things gets annoying.
I've been running the 'important' application threads under the
RT scheduler. This makes their cpu assignment very sticky.
So nearly as good a pinning but the scheduler decides where they go.

This afternoon I tried using threaded napi for the ethernet interface.
(Suggested by Eric.)
This can only really work if the threads run under the RT scheduler.
I can't see an easy way to do this, apart from:
  (cd /proc; for pid in [1-9]*; do
	comm=$(cat $pid/comm);
	[ "${comm#napi/}" != "$comm" ] && chrt --pid 50 $pid;
  done;)
Since I was only running 35 RT application threads the extra 5
napi ones is exactly one for each cpu and AFAICT they all run
on separate cpu.

With threaded napi (and RPS) I'm only seeing 250 (or so)
busy ethernet ring entries in the napi code - not the 1000+
I was getting with the default __do_softirq() code.
Similar to stopping the softint code falling back to its thread.
I'm still losing packets though (sometimes over 100/sec)
not sure if the hardware has no free ring buffers or
whether the switch is dropping some of them.

Apart from python (which needs pinning to a single cpu) I'm not
sure how much effect pinning a normal priority process to a
cpu has - unless you also pin more general processes away from
that cpu.
Running under the RT scheduler (provided you don't create too
many RT processes) sort of gives each one its own cpu while
allowing other processes to use the cpu when it is idle.
Straight forward pinning of processes doesn't do that.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)