RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

From: Long Li <longli@microsoft.com>
To: Sagi Grimberg <sagi@grimberg.me>, Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@fb.com>, Hannes Reinecke <hare@suse.com>,
	John Garry <john.garry@huawei.com>,
	Bart Van Assche <bvanassche@acm.org>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	LKML <linux-kernel@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	Keith Busch <keith.busch@intel.com>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>
Subject: RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Date: Tue, 24 Sep 2019 00:57:39 +0000	[thread overview]
Message-ID: <CY4PR21MB0741BC9992A7A945A0D4A62CCE840@CY4PR21MB0741.namprd21.prod.outlook.com> (raw)
In-Reply-To: <100d001a-1dda-32ff-fa5e-c18b121444d9@grimberg.me>

>Thanks for the clarification.
>
>The problem with what Ming is proposing in my mind (and its an existing
>problem that exists today), is that nvme is taking precedence over anything
>else until it absolutely cannot hog the cpu in hardirq.
>
>In the thread Ming referenced a case where today if the cpu core has a net
>softirq activity it cannot make forward progress. So with Ming's suggestion,
>net softirq will eventually make progress, but it creates an inherent fairness
>issue. Who said that nvme completions should come faster then the net rx/tx
>or another I/O device (or hrtimers or sched events...)?
>
>As much as I'd like nvme to complete as soon as possible, I might have other
>activities in the system that are as important if not more. So I don't think we
>can solve this with something that is not cooperative or fair with the rest of
>the system.
>
>>> If we are context switching too much, it means the soft-irq operation
>>> is not efficient, not necessarily the fact that the completion path
>>> is running in soft- irq..
>>>
>>> Is your kernel compiled with full preemption or voluntary preemption?
>>
>> The tests are based on Ubuntu 18.04 kernel configuration. Here are the
>parameters:
>>
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
>
>I see, so it still seems that irq_poll_softirq is still not efficient in reaping
>completions. reaping the completions on its own is pretty much the same in
>hard and soft irq, so its really the scheduling part that is creating the overhead
>(which does not exist in hard irq).
>
>Question:
>when you test with without the patch (completions are coming in hard-irq),
>do the fio threads that run on the cpu cores that are assigned to the cores that
>are handling interrupts get substantially lower throughput than the rest of the
>fio threads? I would expect that the fio threads that are running on the first 32
>cores to get very low iops (overpowered by the nvme interrupts) and the rest
>doing much more given that nvme has almost no limits to how much time it
>can spend on processing completions.
>
>If need_resched() is causing us to context switch too aggressively, does
>changing that to local_softirq_pending() make things better?
>--
>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index d8eab563fa77..05d524fcaf04
>100644
>--- a/lib/irq_poll.c
>+++ b/lib/irq_poll.c
>@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct
>softirq_action *h)
>                 /*
>                  * If softirq window is exhausted then punt.
>                  */
>-               if (need_resched())
>+               if (local_softirq_pending())
>                         break;
>         }
>--
>
>Although, this can potentially cause other threads from making forward
>progress.. If it is better, perhaps we also need a time limit as well.

Thanks for this patch. The IOPS was about the same. (it tends to fluctuate more but within 3% variation)

I captured the following from one of the CPUs. All CPUs tend to have similar numbers. The following numbers are captured during 5 seconds and averaged:

Context switches/s:
Without any patch: 5
With the previous patch: 640
With this patch: 522

Process migrated/s:
Without any patch: 0.6
With the previous patch: 104
With this patch: 121

>
>Perhaps we should add statistics/tracing on how many completions we are
>reaping per invocation...

I'll look into a bit more on completion. From the numbers I think the increased number of context switches/migrations are hurting most on performance.

Thanks

Long