Re: [PATCH 0/3] softirq: uncontroversial change

From: Paolo Abeni <pabeni@redhat.com>
To: Jakub Kicinski <kuba@kernel.org>,
	peterz@infradead.org, tglx@linutronix.de
Cc: jstultz@google.com, edumazet@google.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3] softirq: uncontroversial change
Date: Thu, 20 Apr 2023 19:24:02 +0200	[thread overview]
Message-ID: <305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com> (raw)
In-Reply-To: <20221222221244.1290833-1-kuba@kernel.org>

Hi all,
On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote:
> Catching up on LWN I run across the article about softirq
> changes, and then I noticed fresh patches in Peter's tree.
> So probably wise for me to throw these out there.
> 
> My (can I say Meta's?) problem is the opposite to what the RT
> sensitive people complain about. In the current scheme once
> ksoftirqd is woken no network processing happens until it runs.
> 
> When networking gets overloaded - that's probably fair, the problem
> is that we confuse latency tweaks with overload protection. We have
> a needs_resched() in the loop condition (which is a latency tweak)
> Most often we defer to ksoftirqd because we're trying to be nice
> and let user space respond quickly, not because there is an
> overload. But the user space may not be nice, and sit on the CPU
> for 10ms+. Also the sirq's "work allowance" is 2ms, which is
> uncomfortably close to the timer tick, but that's another story.
> 
> We have a sirq latency tracker in our prod kernel which catches
> 8ms+ stalls of net Tx (packets queued to the NIC but there is
> no NAPI cleanup within 8ms) and with these patches applied
> on 5.19 fully loaded web machine sees a drop in stalls from
> 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
> TCP retransmissions and ~10% drop in non-TLP incoming ones.
> This is not a network-heavy workload so most of the rtx are
> due to scheduling artifacts.
> 
> The network latency in a datacenter is somewhere around neat
> 1000x lower than scheduling granularity (around 10us).
> 
> These patches (patch 2 is "the meat") change what we recognize
> as overload. Instead of just checking if "ksoftirqd is woken"
> it also caps how long we consider ourselves to be in overload,
> a time limit which is different based on whether we yield due
> to real resource exhaustion vs just hitting that needs_resched().
> 
> I hope the core concept is not entirely idiotic. It'd be great
> if we could get this in or fold an equivalent concept into ongoing
> work from others, because due to various "scheduler improvements"
> every time we upgrade the production kernel this problem is getting
> worse :(

Please allow me to revive this old thread.

My understanding is that we want to avoid adding more heuristics here,
preferring a consistent refactor.

I would like to propose a revert of:

4cd13c21b207 softirq: Let ksoftirqd do its job

the its follow-ups:

3c53776e29f8 Mark HI and TASKLET softirq synchronous
0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking

The problem originally addressed by 4cd13c21b207 can now be tackled
with the threaded napi, available since:

29863d41bb6e net: implement threaded-able napi poll loop support

Reverting the mentioned commit should address the latency issues
mentioned by Jakub - I verified it solves a somewhat related problem in
my setup - and reduces the layering of heuristics in this area.

A refactor introducing uniform overload detection and proper resource
control will be better, but I admit it's beyond me and anyway it could
still land afterwards.

Any opinion more then welcome!

Thanks,

Paolo