From: John Garry <john.garry@huawei.com> To: Ming Lei <tom.leiming@gmail.com>, <longli@linuxonhyperv.com> Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Keith Busch <keith.busch@intel.com>, Jens Axboe <axboe@fb.com>, "Christoph Hellwig" <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, linux-nvme <linux-nvme@lists.infradead.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Long Li <longli@microsoft.com>, "Thomas Gleixner" <tglx@linutronix.de>, chenxiang <chenxiang66@hisilicon.com> Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe Date: Tue, 20 Aug 2019 09:59:32 +0100 Message-ID: <fd7d6101-37f4-2d34-f2f7-cfeade610278@huawei.com> (raw) In-Reply-To: <CACVXFVPCiTU0mtXKS0fyMccPXN6hAdZNHv6y-f8-tz=FE=BV=g@mail.gmail.com> On 20/08/2019 09:25, Ming Lei wrote: > On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote: >> >> From: Long Li <longli@microsoft.com> >> >> This patch set tries to fix interrupt swamp in NVMe devices. >> >> On large systems with many CPUs, a number of CPUs may share one NVMe hardware >> queue. It may have this situation where several CPUs are issuing I/Os, and >> all the I/Os are returned on the CPU where the hardware queue is bound to. >> This may result in that CPU swamped by interrupts and stay in interrupt mode >> for extended time while other CPUs continue to issue I/O. This can trigger >> Watchdog and RCU timeout, and make the system unresponsive. >> >> This patch set addresses this by enforcing scheduling and throttling I/O when >> CPU is starved in this situation. >> >> Long Li (3): >> sched: define a function to report the number of context switches on a >> CPU >> sched: export idle_cpu() >> nvme: complete request in work queue on CPU with flooded interrupts >> >> drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++- >> drivers/nvme/host/nvme.h | 1 + >> include/linux/sched.h | 2 ++ >> kernel/sched/core.c | 7 +++++ >> 4 files changed, 66 insertions(+), 1 deletion(-) > > Another simpler solution may be to complete request in threaded interrupt > handler for this case. Meantime allow scheduler to run the interrupt thread > handler on CPUs specified by the irq affinity mask, which was discussed by > the following link: > > https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d056383e@huawei.com/ > > Could you try the above solution and see if the lockup can be avoided? > John Garry > should have workable patch. Yeah, so we experimented with changing the interrupt handling in the SCSI driver I maintain to use a threaded handler IRQ handler plus patch below, and saw a significant throughput boost: --->8 Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity Currently the cpu allowed mask for the threaded part of a threaded irq handler will be set to the effective affinity of the hard irq. Typically the effective affinity of the hard irq will be for a single cpu. As such, the threaded handler would always run on the same cpu as the hard irq. We have seen scenarios in high data-rate throughput testing that the cpu handling the interrupt can be totally saturated handling both the hard interrupt and threaded handler parts, limiting throughput. Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded interrupt to decide on the policy of which cpu the threaded handler may run. Signed-off-by: John Garry <john.garry@huawei.com> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 5b8328a99b2a..48e8b955989a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -61,6 +61,9 @@ * interrupt handler after suspending interrupts. For system * wakeup devices users need to implement wakeup detection in * their interrupt handlers. + * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu + * allowed mask for the threaded handler of a threaded interrupt + * handler, rather than the effective hard irq affinity. */ #define IRQF_SHARED 0x00000080 #define IRQF_PROBE_SHARED 0x00000100 @@ -74,6 +77,7 @@ #define IRQF_NO_THREAD 0x00010000 #define IRQF_EARLY_RESUME 0x00020000 #define IRQF_COND_SUSPEND 0x00040000 +#define IRQF_IRQ_AFFINITY 0x00080000 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index e8f7f179bf77..cb483a055512 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out. */ if (cpumask_available(desc->irq_common_data.affinity)) { + struct irq_data *irq_data = &desc->irq_data; const struct cpumask *m; - m = irq_data_get_effective_affinity_mask(&desc->irq_data); + if (action->flags & IRQF_IRQ_AFFINITY) + m = desc->irq_common_data.affinity; + else + m = irq_data_get_effective_affinity_mask(irq_data); cpumask_copy(mask, m); } else { valid = false; -- 2.17.1 As Ming mentioned in that same thread, we could even make this policy for managed interrupts. Cheers, John > > Thanks, > Ming Lei > > . >
next prev parent reply index Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-08-20 6:14 longli 2019-08-20 6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli 2019-08-20 9:38 ` Peter Zijlstra 2019-08-21 8:20 ` Long Li 2019-08-21 10:34 ` Peter Zijlstra 2019-08-20 9:39 ` Peter Zijlstra 2019-08-20 6:14 ` [PATCH 2/3] sched: export idle_cpu() longli 2019-08-20 6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli 2019-08-20 9:52 ` Peter Zijlstra 2019-08-21 8:37 ` Long Li 2019-08-21 10:35 ` Peter Zijlstra 2019-08-20 17:33 ` Sagi Grimberg 2019-08-21 8:39 ` Long Li 2019-08-21 17:36 ` Long Li 2019-08-21 21:54 ` Sagi Grimberg 2019-08-24 0:13 ` Long Li 2019-08-23 3:21 ` Ming Lei 2019-08-24 0:27 ` Long Li 2019-08-24 12:55 ` Ming Lei 2019-08-20 8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei 2019-08-20 8:59 ` John Garry [this message] 2019-08-20 15:05 ` Keith Busch 2019-08-21 7:47 ` Long Li 2019-08-21 9:44 ` Ming Lei 2019-08-21 10:03 ` John Garry 2019-08-21 16:27 ` Long Li 2019-08-22 1:33 ` Ming Lei 2019-08-22 2:00 ` Keith Busch 2019-08-22 2:23 ` Ming Lei 2019-08-22 9:48 ` Thomas Gleixner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=fd7d6101-37f4-2d34-f2f7-cfeade610278@huawei.com \ --to=john.garry@huawei.com \ --cc=axboe@fb.com \ --cc=chenxiang66@hisilicon.com \ --cc=hch@lst.de \ --cc=keith.busch@intel.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-nvme@lists.infradead.org \ --cc=longli@linuxonhyperv.com \ --cc=longli@microsoft.com \ --cc=mingo@redhat.com \ --cc=peterz@infradead.org \ --cc=sagi@grimberg.me \ --cc=tglx@linutronix.de \ --cc=tom.leiming@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ linux-kernel@vger.kernel.org public-inbox-index lkml Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git