linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] fix interrupt swamp in NVMe
@ 2019-08-20  6:14 longli
  2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: longli @ 2019-08-20  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

This patch set tries to fix interrupt swamp in NVMe devices.

On large systems with many CPUs, a number of CPUs may share one NVMe hardware
queue. It may have this situation where several CPUs are issuing I/Os, and
all the I/Os are returned on the CPU where the hardware queue is bound to.
This may result in that CPU swamped by interrupts and stay in interrupt mode
for extended time while other CPUs continue to issue I/O. This can trigger
Watchdog and RCU timeout, and make the system unresponsive.

This patch set addresses this by enforcing scheduling and throttling I/O when
CPU is starved in this situation.

Long Li (3):
  sched: define a function to report the number of context switches on a
    CPU
  sched: export idle_cpu()
  nvme: complete request in work queue on CPU with flooded interrupts

 drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 include/linux/sched.h    |  2 ++
 kernel/sched/core.c      |  7 +++++
 4 files changed, 66 insertions(+), 1 deletion(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/3] sched: define a function to report the number of context switches on a CPU
  2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
@ 2019-08-20  6:14 ` longli
  2019-08-20  9:38   ` Peter Zijlstra
  2019-08-20  9:39   ` Peter Zijlstra
  2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 30+ messages in thread
From: longli @ 2019-08-20  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

The number of context switches on a CPU is useful to determine how busy this
CPU is on processing IRQs. Export this information so it can be used by device
drivers.

Signed-off-by: Long Li <longli@microsoft.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b35aff09f70..575f1ef7b159 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1500,6 +1500,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
 
 extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
+extern u64 get_cpu_rq_switches(int cpu);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4a8e7207cafa..1a76f0e97c2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1143,6 +1143,12 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 }
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
+u64 get_cpu_rq_switches(int cpu)
+{
+	return cpu_rq(cpu)->nr_switches;
+}
+EXPORT_SYMBOL_GPL(get_cpu_rq_switches);
+
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 #ifdef CONFIG_SCHED_DEBUG
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] sched: export idle_cpu()
  2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
  2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
@ 2019-08-20  6:14 ` longli
  2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
  2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
  3 siblings, 0 replies; 30+ messages in thread
From: longli @ 2019-08-20  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

This function is useful for device drivers to check if this CPU has work to
do in process context.

Signed-off-by: Long Li <longli@microsoft.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 575f1ef7b159..a209941c1770 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,7 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
 extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed);
 extern u64 get_cpu_rq_switches(int cpu);
+extern int idle_cpu(int cpu);
 #ifdef CONFIG_SMP
 extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
 extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a76f0e97c2d..d1cedfb38174 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4023,6 +4023,7 @@ int idle_cpu(int cpu)
 
 	return 1;
 }
+EXPORT_SYMBOL_GPL(idle_cpu);
 
 /**
  * available_idle_cpu - is a given CPU idle for enqueuing work.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
  2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
  2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
@ 2019-08-20  6:14 ` longli
  2019-08-20  9:52   ` Peter Zijlstra
  2019-08-20 17:33   ` Sagi Grimberg
  2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
  3 siblings, 2 replies; 30+ messages in thread
From: longli @ 2019-08-20  6:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.

For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware queue interrupts CPU 0 for I/O response
3. processes from CPU 1, 2 and 3 keep sending I/Os

CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
all the time serving NVMe and other system interrupts, but doesn't have a
chance to run in process context.

To fix this, CPU 0 can schedule a work to complete the I/O request when it
detects the scheduler is not making progress. This serves multiple purposes:

1. This CPU has to be scheduled to complete the request. The other CPUs can't
issue more I/Os until some previous I/Os are completed. This helps this CPU
get out of NVMe interrupts.

2. This acts a throttling mechanisum for NVMe devices, in that it can not
starve a CPU while servicing I/Os from other CPUs.

3. This CPU can make progress on RCU and other work items on its queue.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68c0f4f..576bb6fce293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -28,6 +28,7 @@
 #include <linux/t10-pi.h>
 #include <linux/pm_qos.h>
 #include <asm/unaligned.h>
+#include <linux/sched.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -97,6 +98,15 @@ static dev_t nvme_chr_devt;
 static struct class *nvme_class;
 static struct class *nvme_subsys_class;
 
+/*
+ * The following are for detecting if this CPU is flooded with interrupts.
+ * The timestamp for the last context switch is recorded. If that is at least
+ * MAX_SCHED_TIMEOUT ago, try to recover from interrupt flood
+ */
+static DEFINE_PER_CPU(u64, last_switch);
+static DEFINE_PER_CPU(u64, last_clock);
+#define MAX_SCHED_TIMEOUT 2000000000	// 2 seconds in ns
+
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
@@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
 	blk_mq_delay_kick_requeue_list(req->q, delay);
 }
 
+static void nvme_complete_rq_work(struct work_struct *work)
+{
+	struct nvme_request *nvme_rq =
+		container_of(work, struct nvme_request, work);
+	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
+
+	nvme_complete_rq(req);
+}
+
+
 void nvme_complete_rq(struct request *req)
 {
-	blk_status_t status = nvme_error_status(req);
+	blk_status_t status;
+	int cpu;
+	u64 switches;
+	struct nvme_request *nvme_rq;
+
+	if (!in_interrupt())
+		goto skip_check;
+
+	nvme_rq = nvme_req(req);
+	cpu = smp_processor_id();
+	if (idle_cpu(cpu))
+		goto skip_check;
+
+	/* Check if this CPU is flooded with interrupts */
+	switches = get_cpu_rq_switches(cpu);
+	if (this_cpu_read(last_switch) == switches) {
+		/*
+		 * If this CPU hasn't made a context switch in
+		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
+		 * complete this I/O. This forces this CPU run non-interrupt
+		 * code and throttle the other CPU issuing the I/O
+		 */
+		if (sched_clock() - this_cpu_read(last_clock)
+				> MAX_SCHED_TIMEOUT) {
+			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
+			schedule_work_on(cpu, &nvme_rq->work);
+			return;
+		}
+
+	} else {
+		this_cpu_write(last_switch, switches);
+		this_cpu_write(last_clock, sched_clock());
+	}
+
+skip_check:
+	status = nvme_error_status(req);
 
 	trace_nvme_complete_rq(req);
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0a4a7f5e0de7..a8876e69e476 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -113,6 +113,7 @@ struct nvme_request {
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;
+	struct work_struct	work;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
                   ` (2 preceding siblings ...)
  2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
@ 2019-08-20  8:25 ` Ming Lei
  2019-08-20  8:59   ` John Garry
  3 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-08-20  8:25 UTC (permalink / raw)
  To: longli
  Cc: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Linux Kernel Mailing List, Long Li, John Garry, Thomas Gleixner

On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
>
> From: Long Li <longli@microsoft.com>
>
> This patch set tries to fix interrupt swamp in NVMe devices.
>
> On large systems with many CPUs, a number of CPUs may share one NVMe hardware
> queue. It may have this situation where several CPUs are issuing I/Os, and
> all the I/Os are returned on the CPU where the hardware queue is bound to.
> This may result in that CPU swamped by interrupts and stay in interrupt mode
> for extended time while other CPUs continue to issue I/O. This can trigger
> Watchdog and RCU timeout, and make the system unresponsive.
>
> This patch set addresses this by enforcing scheduling and throttling I/O when
> CPU is starved in this situation.
>
> Long Li (3):
>   sched: define a function to report the number of context switches on a
>     CPU
>   sched: export idle_cpu()
>   nvme: complete request in work queue on CPU with flooded interrupts
>
>  drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/nvme.h |  1 +
>  include/linux/sched.h    |  2 ++
>  kernel/sched/core.c      |  7 +++++
>  4 files changed, 66 insertions(+), 1 deletion(-)

Another simpler solution may be to complete request in threaded interrupt
handler for this case. Meantime allow scheduler to run the interrupt thread
handler on CPUs specified by the irq affinity mask, which was discussed by
the following link:

https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d056383e@huawei.com/

Could you try the above solution and see if the lockup can be avoided?
John Garry
should have workable patch.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
@ 2019-08-20  8:59   ` John Garry
  2019-08-20 15:05     ` Keith Busch
  2019-08-21  7:47     ` Long Li
  0 siblings, 2 replies; 30+ messages in thread
From: John Garry @ 2019-08-20  8:59 UTC (permalink / raw)
  To: Ming Lei, longli
  Cc: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Linux Kernel Mailing List, Long Li, Thomas Gleixner, chenxiang

On 20/08/2019 09:25, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
>>
>> From: Long Li <longli@microsoft.com>
>>
>> This patch set tries to fix interrupt swamp in NVMe devices.
>>
>> On large systems with many CPUs, a number of CPUs may share one NVMe hardware
>> queue. It may have this situation where several CPUs are issuing I/Os, and
>> all the I/Os are returned on the CPU where the hardware queue is bound to.
>> This may result in that CPU swamped by interrupts and stay in interrupt mode
>> for extended time while other CPUs continue to issue I/O. This can trigger
>> Watchdog and RCU timeout, and make the system unresponsive.
>>
>> This patch set addresses this by enforcing scheduling and throttling I/O when
>> CPU is starved in this situation.
>>
>> Long Li (3):
>>   sched: define a function to report the number of context switches on a
>>     CPU
>>   sched: export idle_cpu()
>>   nvme: complete request in work queue on CPU with flooded interrupts
>>
>>  drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>>  drivers/nvme/host/nvme.h |  1 +
>>  include/linux/sched.h    |  2 ++
>>  kernel/sched/core.c      |  7 +++++
>>  4 files changed, 66 insertions(+), 1 deletion(-)
>
> Another simpler solution may be to complete request in threaded interrupt
> handler for this case. Meantime allow scheduler to run the interrupt thread
> handler on CPUs specified by the irq affinity mask, which was discussed by
> the following link:
>
> https://lore.kernel.org/lkml/e0e9478e-62a5-ca24-3b12-58f7d056383e@huawei.com/
>
> Could you try the above solution and see if the lockup can be avoided?
> John Garry
> should have workable patch.

Yeah, so we experimented with changing the interrupt handling in the 
SCSI driver I maintain to use a threaded handler IRQ handler plus patch 
below, and saw a significant throughput boost:

--->8

Subject: [PATCH] genirq: Add support to allow thread to use hard irq 
affinity

Currently the cpu allowed mask for the threaded part of a threaded irq
handler will be set to the effective affinity of the hard irq.

Typically the effective affinity of the hard irq will be for a single
cpu. As such, the threaded handler would always run on the same cpu as
the hard irq.

We have seen scenarios in high data-rate throughput testing that the
cpu handling the interrupt can be totally saturated handling both the
hard interrupt and threaded handler parts, limiting throughput.

Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
interrupt to decide on the policy of which cpu the threaded handler
may run.

Signed-off-by: John Garry <john.garry@huawei.com>

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a99b2a..48e8b955989a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
   *                interrupt handler after suspending interrupts. For 
system
   *                wakeup devices users need to implement wakeup 
detection in
   *                their interrupt handlers.
+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
+ *                allowed mask for the threaded handler of a threaded 
interrupt
+ *                handler, rather than the effective hard irq affinity.
   */
  #define IRQF_SHARED		0x00000080
  #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +77,7 @@
  #define IRQF_NO_THREAD		0x00010000
  #define IRQF_EARLY_RESUME	0x00020000
  #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_IRQ_AFFINITY	0x00080000

  #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..cb483a055512 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, 
struct irqaction *action)
  	 * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
  	 */
  	if (cpumask_available(desc->irq_common_data.affinity)) {
+		struct irq_data *irq_data = &desc->irq_data;
  		const struct cpumask *m;

-		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+		if (action->flags & IRQF_IRQ_AFFINITY)
+			m = desc->irq_common_data.affinity;
+		else
+			m = irq_data_get_effective_affinity_mask(irq_data);
  		cpumask_copy(mask, m);
  	} else {
  		valid = false;
-- 
2.17.1

As Ming mentioned in that same thread, we could even make this policy 
for managed interrupts.

Cheers,
John

>
> Thanks,
> Ming Lei
>
> .
>



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] sched: define a function to report the number of context switches on a CPU
  2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
@ 2019-08-20  9:38   ` Peter Zijlstra
  2019-08-21  8:20     ` Long Li
  2019-08-20  9:39   ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2019-08-20  9:38 UTC (permalink / raw)
  To: longli
  Cc: Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel, Long Li

On Mon, Aug 19, 2019 at 11:14:27PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> The number of context switches on a CPU is useful to determine how busy this
> CPU is on processing IRQs. Export this information so it can be used by device
> drivers.

Please do explain that; because I'm not seeing how number of switches
relates to processing IRQs _at_all_!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] sched: define a function to report the number of context switches on a CPU
  2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
  2019-08-20  9:38   ` Peter Zijlstra
@ 2019-08-20  9:39   ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-08-20  9:39 UTC (permalink / raw)
  To: longli
  Cc: Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel, Long Li

On Mon, Aug 19, 2019 at 11:14:27PM -0700, longli@linuxonhyperv.com wrote:

> +u64 get_cpu_rq_switches(int cpu)
> +{
> +	return cpu_rq(cpu)->nr_switches;
> +}
> +EXPORT_SYMBOL_GPL(get_cpu_rq_switches);

Also, that is broken on 32bit.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
@ 2019-08-20  9:52   ` Peter Zijlstra
  2019-08-21  8:37     ` Long Li
  2019-08-20 17:33   ` Sagi Grimberg
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2019-08-20  9:52 UTC (permalink / raw)
  To: longli
  Cc: Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel, Long Li

On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
> 
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
> 
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.

Ideally -- and there is some code to affect this, the load-balancer will
move tasks away from this CPU.

> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:

Suppose the task waiting for the IO completion is a RT task, and you've
just queued it to a regular work. This is an instant priority inversion.

> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
> 
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
> 
> 3. This CPU can make progress on RCU and other work items on its queue.
> 
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/nvme/host/core.c | 57 +++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/nvme.h |  1 +
>  2 files changed, 57 insertions(+), 1 deletion(-)

WTH does this live in the NVME driver? Surely something like this should
be in the block layer. I'm thinking there's fiber channel connected
storage that should be able to trigger much the same issues.

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68c0f4f..576bb6fce293 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c

> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>  	blk_mq_delay_kick_requeue_list(req->q, delay);
>  }
>  
> +static void nvme_complete_rq_work(struct work_struct *work)
> +{
> +	struct nvme_request *nvme_rq =
> +		container_of(work, struct nvme_request, work);
> +	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
> +
> +	nvme_complete_rq(req);
> +}
> +
> +
>  void nvme_complete_rq(struct request *req)
>  {
> -	blk_status_t status = nvme_error_status(req);
> +	blk_status_t status;
> +	int cpu;
> +	u64 switches;
> +	struct nvme_request *nvme_rq;
> +
> +	if (!in_interrupt())
> +		goto skip_check;
> +
> +	nvme_rq = nvme_req(req);
> +	cpu = smp_processor_id();
> +	if (idle_cpu(cpu))
> +		goto skip_check;
> +
> +	/* Check if this CPU is flooded with interrupts */
> +	switches = get_cpu_rq_switches(cpu);
> +	if (this_cpu_read(last_switch) == switches) {
> +		/*
> +		 * If this CPU hasn't made a context switch in
> +		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a work to
> +		 * complete this I/O. This forces this CPU run non-interrupt
> +		 * code and throttle the other CPU issuing the I/O
> +		 */

What if there was only a single task on that CPU? Then we'd never
need/want to context switch in the first place.

AFAICT all this is just a whole bunch of gruesome hacks piled on top one
another.

> +		if (sched_clock() - this_cpu_read(last_clock)
> +				> MAX_SCHED_TIMEOUT) {
> +			INIT_WORK(&nvme_rq->work, nvme_complete_rq_work);
> +			schedule_work_on(cpu, &nvme_rq->work);
> +			return;
> +		}
> +
> +	} else {
> +		this_cpu_write(last_switch, switches);
> +		this_cpu_write(last_clock, sched_clock());
> +	}
> +
> +skip_check:

Aside from everything else; this is just sodding poor coding style. What
is wrong with something like:

	if (nvme_complete_throttle(...))
		return;

> +	status = nvme_error_status(req);
>  
>  	trace_nvme_complete_rq(req);
>  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-20  8:59   ` John Garry
@ 2019-08-20 15:05     ` Keith Busch
  2019-08-21  7:47     ` Long Li
  1 sibling, 0 replies; 30+ messages in thread
From: Keith Busch @ 2019-08-20 15:05 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, longli, Ingo Molnar, Peter Zijlstra, Busch, Keith,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Linux Kernel Mailing List, Long Li, Thomas Gleixner, chenxiang

On Tue, Aug 20, 2019 at 01:59:32AM -0700, John Garry wrote:
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f179bf77..cb483a055512 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc, 
> struct irqaction *action)
>   	 * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
>   	 */
>   	if (cpumask_available(desc->irq_common_data.affinity)) {
> +		struct irq_data *irq_data = &desc->irq_data;
>   		const struct cpumask *m;
> 
> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +		if (action->flags & IRQF_IRQ_AFFINITY)
> +			m = desc->irq_common_data.affinity;
> +		else
> +			m = irq_data_get_effective_affinity_mask(irq_data);
>   		cpumask_copy(mask, m);
>   	} else {
>   		valid = false;
> -- 
> 2.17.1
> 
> As Ming mentioned in that same thread, we could even make this policy 
> for managed interrupts.

Ack, I really like this option!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
  2019-08-20  9:52   ` Peter Zijlstra
@ 2019-08-20 17:33   ` Sagi Grimberg
  2019-08-21  8:39     ` Long Li
  2019-08-23  3:21     ` Ming Lei
  1 sibling, 2 replies; 30+ messages in thread
From: Sagi Grimberg @ 2019-08-20 17:33 UTC (permalink / raw)
  To: longli, Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, linux-nvme, linux-kernel
  Cc: Long Li


> From: Long Li <longli@microsoft.com>
> 
> When a NVMe hardware queue is mapped to several CPU queues, it is possible
> that the CPU this hardware queue is bound to is flooded by returning I/O for
> other CPUs.
> 
> For example, consider the following scenario:
> 1. CPU 0, 1, 2 and 3 share the same hardware queue
> 2. the hardware queue interrupts CPU 0 for I/O response
> 3. processes from CPU 1, 2 and 3 keep sending I/Os
> 
> CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> all the time serving NVMe and other system interrupts, but doesn't have a
> chance to run in process context.
> 
> To fix this, CPU 0 can schedule a work to complete the I/O request when it
> detects the scheduler is not making progress. This serves multiple purposes:
> 
> 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> issue more I/Os until some previous I/Os are completed. This helps this CPU
> get out of NVMe interrupts.
> 
> 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> starve a CPU while servicing I/Os from other CPUs.
> 
> 3. This CPU can make progress on RCU and other work items on its queue.

The problem is indeed real, but this is the wrong approach in my mind.

We already have irqpoll which takes care proper budgeting polling
cycles and not hogging the cpu.

I've sent rfc for this particular problem before [1]. At the time IIRC,
Christoph suggested that we will poll the first batch directly from
the irq context and reap the rest in irqpoll handler.

[1]: 
http://lists.infradead.org/pipermail/linux-nvme/2016-October/006497.html

How about something like this instead:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71127a366d3c..84bf16d75109 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
  #include <linux/io-64-nonatomic-lo-hi.h>
  #include <linux/sed-opal.h>
  #include <linux/pci-p2pdma.h>
+#include <linux/irq_poll.h>

  #include "trace.h"
  #include "nvme.h"
@@ -32,6 +33,7 @@
  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))

  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ   256

  /*
   * These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
         u32 *dbbuf_cq_db;
         u32 *dbbuf_sq_ei;
         u32 *dbbuf_cq_ei;
+       struct irq_poll iop;
         struct completion delete_done;
  };

@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct 
nvme_queue *nvmeq, u16 *start,
         return found;
  }

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, 
iop);
+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+       u16 start, end;
+       int completed;
+
+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
+       nvme_complete_cqes(nvmeq, start, end);
+       if (completed < budget) {
+               irq_poll_complete(&nvmeq->iop);
+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+       }
+
+       return completed;
+}
+
  static irqreturn_t nvme_irq(int irq, void *data)
  {
         struct nvme_queue *nvmeq = data;
@@ -1028,12 +1048,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
         rmb();
         if (nvmeq->cq_head != nvmeq->last_cq_head)
                 ret = IRQ_HANDLED;
-       nvme_process_cq(nvmeq, &start, &end, -1);
+       nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
         nvmeq->last_cq_head = nvmeq->cq_head;
         wmb();

         if (start != end) {
                 nvme_complete_cqes(nvmeq, start, end);
+               if (nvme_cqe_pending(nvmeq)) {
+                       disable_irq_nosync(irq);
+                       irq_poll_sched(&nvmeq->iop);
+               }
                 return IRQ_HANDLED;
         }

@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return 
nvme_timeout(struct request *req, bool reserved)

  static void nvme_free_queue(struct nvme_queue *nvmeq)
  {
+       irq_poll_disable(&nvmeq->iop);
         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
         if (!nvmeq->sq_cmds)
@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, 
int qid, int depth)
         nvmeq->dev = dev;
         spin_lock_init(&nvmeq->sq_lock);
         spin_lock_init(&nvmeq->cq_poll_lock);
+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, 
nvme_irqpoll_handler);
         nvmeq->cq_head = 0;
         nvmeq->cq_phase = 1;
         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-20  8:59   ` John Garry
  2019-08-20 15:05     ` Keith Busch
@ 2019-08-21  7:47     ` Long Li
  2019-08-21  9:44       ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21  7:47 UTC (permalink / raw)
  To: John Garry, Ming Lei, longli
  Cc: Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Linux Kernel Mailing List, Thomas Gleixner, chenxiang

>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
>>>>>
>>>>> From: Long Li <longli@microsoft.com>
>>>>>
>>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>>>
>>>>> On large systems with many CPUs, a number of CPUs may share one
>>>NVMe
>>>>> hardware queue. It may have this situation where several CPUs are
>>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
>>>hardware queue is bound to.
>>>>> This may result in that CPU swamped by interrupts and stay in
>>>>> interrupt mode for extended time while other CPUs continue to issue
>>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
>>>unresponsive.
>>>>>
>>>>> This patch set addresses this by enforcing scheduling and throttling
>>>>> I/O when CPU is starved in this situation.
>>>>>
>>>>> Long Li (3):
>>>>>   sched: define a function to report the number of context switches on a
>>>>>     CPU
>>>>>   sched: export idle_cpu()
>>>>>   nvme: complete request in work queue on CPU with flooded interrupts
>>>>>
>>>>>  drivers/nvme/host/core.c | 57
>>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>>  drivers/nvme/host/nvme.h |  1 +
>>>>>  include/linux/sched.h    |  2 ++
>>>>>  kernel/sched/core.c      |  7 +++++
>>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
>>>>
>>>> Another simpler solution may be to complete request in threaded
>>>> interrupt handler for this case. Meantime allow scheduler to run the
>>>> interrupt thread handler on CPUs specified by the irq affinity mask,
>>>> which was discussed by the following link:
>>>>
>>>>
>>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>e
>>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>58f7d056383e%40huawei.com
>>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
>>>176d1c08
>>>>
>>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
>>>8401588
>>>>
>>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
>>>Er3U%3D&amp;
>>>> reserved=0
>>>>
>>>> Could you try the above solution and see if the lockup can be avoided?
>>>> John Garry
>>>> should have workable patch.
>>>
>>>Yeah, so we experimented with changing the interrupt handling in the SCSI
>>>driver I maintain to use a threaded handler IRQ handler plus patch below,
>>>and saw a significant throughput boost:
>>>
>>>--->8
>>>
>>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
>>>
>>>Currently the cpu allowed mask for the threaded part of a threaded irq
>>>handler will be set to the effective affinity of the hard irq.
>>>
>>>Typically the effective affinity of the hard irq will be for a single cpu. As such,
>>>the threaded handler would always run on the same cpu as the hard irq.
>>>
>>>We have seen scenarios in high data-rate throughput testing that the cpu
>>>handling the interrupt can be totally saturated handling both the hard
>>>interrupt and threaded handler parts, limiting throughput.
>>>
>>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
>>>interrupt to decide on the policy of which cpu the threaded handler may run.
>>>
>>>Signed-off-by: John Garry <john.garry@huawei.com>

Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.

However I'm seeing reduced performance when using threaded interrupts.

Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1

With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS
With just interrupts and my patch: 3700k IOPS

At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.

In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:

diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
index a1de501a2729..3fb30d16464e 100644
--- a/drivers/pci/irq.c
+++ b/drivers/pci/irq.c
@@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
        va_list ap;
        int ret;
        char *devname;
-       unsigned long irqflags = IRQF_SHARED;
+       unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;

        if (!handler)
                irqflags |= IRQF_ONESHOT;

Thanks

Long

>>>
>>>diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index
>>>5b8328a99b2a..48e8b955989a 100644
>>>--- a/include/linux/interrupt.h
>>>+++ b/include/linux/interrupt.h
>>>@@ -61,6 +61,9 @@
>>>   *                interrupt handler after suspending interrupts. For
>>>system
>>>   *                wakeup devices users need to implement wakeup
>>>detection in
>>>   *                their interrupt handlers.
>>>+ * IRQF_IRQ_AFFINITY - Use the hard interrupt affinity for setting the cpu
>>>+ *                allowed mask for the threaded handler of a threaded
>>>interrupt
>>>+ *                handler, rather than the effective hard irq affinity.
>>>   */
>>>  #define IRQF_SHARED		0x00000080
>>>  #define IRQF_PROBE_SHARED	0x00000100
>>>@@ -74,6 +77,7 @@
>>>  #define IRQF_NO_THREAD		0x00010000
>>>  #define IRQF_EARLY_RESUME	0x00020000
>>>  #define IRQF_COND_SUSPEND	0x00040000
>>>+#define IRQF_IRQ_AFFINITY	0x00080000
>>>
>>>  #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND |
>>>IRQF_NO_THREAD)
>>>
>>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index
>>>e8f7f179bf77..cb483a055512 100644
>>>--- a/kernel/irq/manage.c
>>>+++ b/kernel/irq/manage.c
>>>@@ -966,9 +966,13 @@ irq_thread_check_affinity(struct irq_desc *desc,
>>>struct irqaction *action)
>>>  	 * mask pointer. For CPU_MASK_OFFSTACK=n this is optimized out.
>>>  	 */
>>>  	if (cpumask_available(desc->irq_common_data.affinity)) {
>>>+		struct irq_data *irq_data = &desc->irq_data;
>>>  		const struct cpumask *m;
>>>
>>>-		m = irq_data_get_effective_affinity_mask(&desc-
>>>>irq_data);
>>>+		if (action->flags & IRQF_IRQ_AFFINITY)
>>>+			m = desc->irq_common_data.affinity;
>>>+		else
>>>+			m = irq_data_get_effective_affinity_mask(irq_data);
>>>  		cpumask_copy(mask, m);
>>>  	} else {
>>>  		valid = false;
>>>--
>>>2.17.1
>>>
>>>As Ming mentioned in that same thread, we could even make this policy for
>>>managed interrupts.
>>>
>>>Cheers,
>>>John
>>>
>>>>
>>>> Thanks,
>>>> Ming Lei
>>>>
>>>> .
>>>>
>>>


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: [PATCH 1/3] sched: define a function to report the number of context switches on a CPU
  2019-08-20  9:38   ` Peter Zijlstra
@ 2019-08-21  8:20     ` Long Li
  2019-08-21 10:34       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21  8:20 UTC (permalink / raw)
  To: Peter Zijlstra, longli
  Cc: Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel

>>>Subject: Re: [PATCH 1/3] sched: define a function to report the number of
>>>context switches on a CPU
>>>
>>>On Mon, Aug 19, 2019 at 11:14:27PM -0700, longli@linuxonhyperv.com
>>>wrote:
>>>> From: Long Li <longli@microsoft.com>
>>>>
>>>> The number of context switches on a CPU is useful to determine how
>>>> busy this CPU is on processing IRQs. Export this information so it can
>>>> be used by device drivers.
>>>
>>>Please do explain that; because I'm not seeing how number of switches
>>>relates to processing IRQs _at_all_!

Some kernel components rely on context switch to progress, for example watchdog and RCU. On a CPU with reasonable interrupt load, it continues to make context switches, normally a number of switches per seconds. 

While observing a CPU with heavy interrupt loads, I see that it spends all its time in IRQ and softIRQ, and not to get a chance to do a switch (calling __schedule()) for a long time. This will result in system unresponsive at times. The purpose is to find out if this CPU is in this state, and implement some throttling mechanism to help reduce the number of interrupts. I think the number of switches is not accurate for detecting this condition in the most precise way, but maybe it's good enough.

I agree this may not be the best way. If you have other idea on detecting a CPU is swamped by interrupts, please point me to where to look at.

Thanks

Long



^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20  9:52   ` Peter Zijlstra
@ 2019-08-21  8:37     ` Long Li
  2019-08-21 10:35       ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21  8:37 UTC (permalink / raw)
  To: Peter Zijlstra, longli
  Cc: Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com
>>>wrote:
>>>> From: Long Li <longli@microsoft.com>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>
>>>Ideally -- and there is some code to affect this, the load-balancer will move
>>>tasks away from this CPU.
>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>
>>>Suppose the task waiting for the IO completion is a RT task, and you've just
>>>queued it to a regular work. This is an instant priority inversion.

This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>>
>>>> Signed-off-by: Long Li <longli@microsoft.com>
>>>> ---
>>>>  drivers/nvme/host/core.c | 57
>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>  drivers/nvme/host/nvme.h |  1 +
>>>>  2 files changed, 57 insertions(+), 1 deletion(-)
>>>
>>>WTH does this live in the NVME driver? Surely something like this should be
>>>in the block layer. I'm thinking there's fiber channel connected storage that
>>>should be able to trigger much the same issues.

Yes this can be done in block layer. I'm not sure the best way to accomplish this so implemented a NVMe patch to help test. The test results are promising in that we are getting 99.5% of performance while avoided CPU lockup. The challenge is to find a way to throttle a fast storage device.

>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
>>>> 6a9dd68c0f4f..576bb6fce293 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>
>>>> @@ -260,9 +270,54 @@ static void nvme_retry_req(struct request *req)
>>>>  	blk_mq_delay_kick_requeue_list(req->q, delay);  }
>>>>
>>>> +static void nvme_complete_rq_work(struct work_struct *work) {
>>>> +	struct nvme_request *nvme_rq =
>>>> +		container_of(work, struct nvme_request, work);
>>>> +	struct request *req = blk_mq_rq_from_pdu(nvme_rq);
>>>> +
>>>> +	nvme_complete_rq(req);
>>>> +}
>>>> +
>>>> +
>>>>  void nvme_complete_rq(struct request *req)  {
>>>> -	blk_status_t status = nvme_error_status(req);
>>>> +	blk_status_t status;
>>>> +	int cpu;
>>>> +	u64 switches;
>>>> +	struct nvme_request *nvme_rq;
>>>> +
>>>> +	if (!in_interrupt())
>>>> +		goto skip_check;
>>>> +
>>>> +	nvme_rq = nvme_req(req);
>>>> +	cpu = smp_processor_id();
>>>> +	if (idle_cpu(cpu))
>>>> +		goto skip_check;
>>>> +
>>>> +	/* Check if this CPU is flooded with interrupts */
>>>> +	switches = get_cpu_rq_switches(cpu);
>>>> +	if (this_cpu_read(last_switch) == switches) {
>>>> +		/*
>>>> +		 * If this CPU hasn't made a context switch in
>>>> +		 * MAX_SCHED_TIMEOUT ns (and it's not idle), schedule a
>>>work to
>>>> +		 * complete this I/O. This forces this CPU run non-interrupt
>>>> +		 * code and throttle the other CPU issuing the I/O
>>>> +		 */
>>>
>>>What if there was only a single task on that CPU? Then we'd never
>>>need/want to context switch in the first place.
>>>
>>>AFAICT all this is just a whole bunch of gruesome hacks piled on top one
>>>another.
>>>
>>>> +		if (sched_clock() - this_cpu_read(last_clock)
>>>> +				> MAX_SCHED_TIMEOUT) {
>>>> +			INIT_WORK(&nvme_rq->work,
>>>nvme_complete_rq_work);
>>>> +			schedule_work_on(cpu, &nvme_rq->work);
>>>> +			return;
>>>> +		}
>>>> +
>>>> +	} else {
>>>> +		this_cpu_write(last_switch, switches);
>>>> +		this_cpu_write(last_clock, sched_clock());
>>>> +	}
>>>> +
>>>> +skip_check:
>>>
>>>Aside from everything else; this is just sodding poor coding style. What is
>>>wrong with something like:
>>>
>>>	if (nvme_complete_throttle(...))
>>>		return;
>>>
>>>> +	status = nvme_error_status(req);
>>>>
>>>>  	trace_nvme_complete_rq(req);
>>>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20 17:33   ` Sagi Grimberg
@ 2019-08-21  8:39     ` Long Li
  2019-08-21 17:36       ` Long Li
  2019-08-23  3:21     ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21  8:39 UTC (permalink / raw)
  To: Sagi Grimberg, longli, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> From: Long Li <longli@microsoft.com>
>>>>
>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>> returning I/O for other CPUs.
>>>>
>>>> For example, consider the following scenario:
>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
>>>> 3 keep sending I/Os
>>>>
>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> that CPU 0 spends all the time serving NVMe and other system
>>>> interrupts, but doesn't have a chance to run in process context.
>>>>
>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> when it detects the scheduler is not making progress. This serves multiple
>>>purposes:
>>>>
>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> This helps this CPU get out of NVMe interrupts.
>>>>
>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it can
>>>> not starve a CPU while servicing I/Os from other CPUs.
>>>>
>>>> 3. This CPU can make progress on RCU and other work items on its queue.
>>>
>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>
>>>We already have irqpoll which takes care proper budgeting polling cycles
>>>and not hogging the cpu.
>>>
>>>I've sent rfc for this particular problem before [1]. At the time IIRC,
>>>Christoph suggested that we will poll the first batch directly from the irq
>>>context and reap the rest in irqpoll handler.

Thanks for the pointer. I will test and report back.

>>>
>>>[1]:
>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.
>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.com%
>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd011d
>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY3BY
>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>
>>>How about something like this instead:
>>>--
>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>71127a366d3c..84bf16d75109 100644
>>>--- a/drivers/nvme/host/pci.c
>>>+++ b/drivers/nvme/host/pci.c
>>>@@ -24,6 +24,7 @@
>>>  #include <linux/io-64-nonatomic-lo-hi.h>
>>>  #include <linux/sed-opal.h>
>>>  #include <linux/pci-p2pdma.h>
>>>+#include <linux/irq_poll.h>
>>>
>>>  #include "trace.h"
>>>  #include "nvme.h"
>>>@@ -32,6 +33,7 @@
>>>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>>>
>>>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>+#define NVME_POLL_BUDGET_IRQ   256
>>>
>>>  /*
>>>   * These can be higher, but we need to ensure that any command doesn't
>>>@@ -189,6 +191,7 @@ struct nvme_queue {
>>>         u32 *dbbuf_cq_db;
>>>         u32 *dbbuf_sq_ei;
>>>         u32 *dbbuf_cq_ei;
>>>+       struct irq_poll iop;
>>>         struct completion delete_done;
>>>  };
>>>
>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>nvme_queue *nvmeq, u16 *start,
>>>         return found;
>>>  }
>>>
>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>>>iop);
>>>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>+       u16 start, end;
>>>+       int completed;
>>>+
>>>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>+       nvme_complete_cqes(nvmeq, start, end);
>>>+       if (completed < budget) {
>>>+               irq_poll_complete(&nvmeq->iop);
>>>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>+       }
>>>+
>>>+       return completed;
>>>+}
>>>+
>>>  static irqreturn_t nvme_irq(int irq, void *data)
>>>  {
>>>         struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@ static
>>>irqreturn_t nvme_irq(int irq, void *data)
>>>         rmb();
>>>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>>>                 ret = IRQ_HANDLED;
>>>-       nvme_process_cq(nvmeq, &start, &end, -1);
>>>+       nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
>>>         nvmeq->last_cq_head = nvmeq->cq_head;
>>>         wmb();
>>>
>>>         if (start != end) {
>>>                 nvme_complete_cqes(nvmeq, start, end);
>>>+               if (nvme_cqe_pending(nvmeq)) {
>>>+                       disable_irq_nosync(irq);
>>>+                       irq_poll_sched(&nvmeq->iop);
>>>+               }
>>>                 return IRQ_HANDLED;
>>>         }
>>>
>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>nvme_timeout(struct request *req, bool reserved)
>>>
>>>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>>>  {
>>>+       irq_poll_disable(&nvmeq->iop);
>>>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>>         if (!nvmeq->sq_cmds)
>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>*dev, int qid, int depth)
>>>         nvmeq->dev = dev;
>>>         spin_lock_init(&nvmeq->sq_lock);
>>>         spin_lock_init(&nvmeq->cq_poll_lock);
>>>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>nvme_irqpoll_handler);
>>>         nvmeq->cq_head = 0;
>>>         nvmeq->cq_phase = 1;
>>>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-21  7:47     ` Long Li
@ 2019-08-21  9:44       ` Ming Lei
  2019-08-21 10:03         ` John Garry
  2019-08-21 16:27         ` Long Li
  0 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2019-08-21  9:44 UTC (permalink / raw)
  To: Long Li
  Cc: John Garry, Ming Lei, longli, Jens Axboe, Sagi Grimberg,
	chenxiang, Peter Zijlstra, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
> >>>>>
> >>>>> From: Long Li <longli@microsoft.com>
> >>>>>
> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>>>
> >>>>> On large systems with many CPUs, a number of CPUs may share one
> >>>NVMe
> >>>>> hardware queue. It may have this situation where several CPUs are
> >>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
> >>>hardware queue is bound to.
> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>>> interrupt mode for extended time while other CPUs continue to issue
> >>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
> >>>unresponsive.
> >>>>>
> >>>>> This patch set addresses this by enforcing scheduling and throttling
> >>>>> I/O when CPU is starved in this situation.
> >>>>>
> >>>>> Long Li (3):
> >>>>>   sched: define a function to report the number of context switches on a
> >>>>>     CPU
> >>>>>   sched: export idle_cpu()
> >>>>>   nvme: complete request in work queue on CPU with flooded interrupts
> >>>>>
> >>>>>  drivers/nvme/host/core.c | 57
> >>>>> +++++++++++++++++++++++++++++++++++++++-
> >>>>>  drivers/nvme/host/nvme.h |  1 +
> >>>>>  include/linux/sched.h    |  2 ++
> >>>>>  kernel/sched/core.c      |  7 +++++
> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
> >>>>
> >>>> Another simpler solution may be to complete request in threaded
> >>>> interrupt handler for this case. Meantime allow scheduler to run the
> >>>> interrupt thread handler on CPUs specified by the irq affinity mask,
> >>>> which was discussed by the following link:
> >>>>
> >>>>
> >>>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
> >>>e
> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>58f7d056383e%40huawei.com
> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
> >>>176d1c08
> >>>>
> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
> >>>8401588
> >>>>
> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
> >>>Er3U%3D&amp;
> >>>> reserved=0
> >>>>
> >>>> Could you try the above solution and see if the lockup can be avoided?
> >>>> John Garry
> >>>> should have workable patch.
> >>>
> >>>Yeah, so we experimented with changing the interrupt handling in the SCSI
> >>>driver I maintain to use a threaded handler IRQ handler plus patch below,
> >>>and saw a significant throughput boost:
> >>>
> >>>--->8
> >>>
> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
> >>>
> >>>Currently the cpu allowed mask for the threaded part of a threaded irq
> >>>handler will be set to the effective affinity of the hard irq.
> >>>
> >>>Typically the effective affinity of the hard irq will be for a single cpu. As such,
> >>>the threaded handler would always run on the same cpu as the hard irq.
> >>>
> >>>We have seen scenarios in high data-rate throughput testing that the cpu
> >>>handling the interrupt can be totally saturated handling both the hard
> >>>interrupt and threaded handler parts, limiting throughput.
> >>>
> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
> >>>interrupt to decide on the policy of which cpu the threaded handler may run.
> >>>
> >>>Signed-off-by: John Garry <john.garry@huawei.com>
> 
> Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.
> 
> However I'm seeing reduced performance when using threaded interrupts.
> 
> Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
> Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1
> 
> With threaded interrupts: 1320k IOPS
> With just interrupts: 3720k IOPS
> With just interrupts and my patch: 3700k IOPS

This gap looks too big wrt. threaded interrupts vs. interrupts.

> 
> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.
> 

In theory, it shouldn't be so because most of times the thread should be running
on CPUs of this hctx, and the wakeup cost shouldn't be so big. Maybe there is
performance problem somewhere wrt. threaded interrupt.

Could you share us your test script and environment? I will see if I can
reproduce it in my environment.

> In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:
> 
> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
> index a1de501a2729..3fb30d16464e 100644
> --- a/drivers/pci/irq.c
> +++ b/drivers/pci/irq.c
> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
>         va_list ap;
>         int ret;
>         char *devname;
> -       unsigned long irqflags = IRQF_SHARED;
> +       unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
> 
>         if (!handler)
>                 irqflags |= IRQF_ONESHOT;
> 

I don't see why IRQF_IRQ_AFFINITY is needed.

John, could you explain it a bit why you need changes on IRQF_IRQ_AFFINITY? 

The following patch should be enough:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..1e7cffc1c20c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
 	if (cpumask_available(desc->irq_common_data.affinity)) {
 		const struct cpumask *m;
 
-		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+		if (irqd_affinity_is_managed(&desc->irq_data))
+			m = desc->irq_common_data.affinity;
+		else
+			m = irq_data_get_effective_affinity_mask(
+					&desc->irq_data);
 		cpumask_copy(mask, m);
 	} else {
 		valid = false;


Thanks,
Ming

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-21  9:44       ` Ming Lei
@ 2019-08-21 10:03         ` John Garry
  2019-08-21 16:27         ` Long Li
  1 sibling, 0 replies; 30+ messages in thread
From: John Garry @ 2019-08-21 10:03 UTC (permalink / raw)
  To: Ming Lei, Long Li
  Cc: Ming Lei, longli, Jens Axboe, Sagi Grimberg, chenxiang,
	Peter Zijlstra, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 21/08/2019 10:44, Ming Lei wrote:
> On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
>>>>> Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>>>
>>>>> On 20/08/2019 09:25, Ming Lei wrote:
>>>>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
>>>>>>>
>>>>>>> From: Long Li <longli@microsoft.com>
>>>>>>>
>>>>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>>>>>
>>>>>>> On large systems with many CPUs, a number of CPUs may share one
>>>>> NVMe
>>>>>>> hardware queue. It may have this situation where several CPUs are
>>>>>>> issuing I/Os, and all the I/Os are returned on the CPU where the
>>>>> hardware queue is bound to.
>>>>>>> This may result in that CPU swamped by interrupts and stay in
>>>>>>> interrupt mode for extended time while other CPUs continue to issue
>>>>>>> I/O. This can trigger Watchdog and RCU timeout, and make the system
>>>>> unresponsive.
>>>>>>>
>>>>>>> This patch set addresses this by enforcing scheduling and throttling
>>>>>>> I/O when CPU is starved in this situation.
>>>>>>>
>>>>>>> Long Li (3):
>>>>>>>   sched: define a function to report the number of context switches on a
>>>>>>>     CPU
>>>>>>>   sched: export idle_cpu()
>>>>>>>   nvme: complete request in work queue on CPU with flooded interrupts
>>>>>>>
>>>>>>>  drivers/nvme/host/core.c | 57
>>>>>>> +++++++++++++++++++++++++++++++++++++++-
>>>>>>>  drivers/nvme/host/nvme.h |  1 +
>>>>>>>  include/linux/sched.h    |  2 ++
>>>>>>>  kernel/sched/core.c      |  7 +++++
>>>>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> Another simpler solution may be to complete request in threaded
>>>>>> interrupt handler for this case. Meantime allow scheduler to run the
>>>>>> interrupt thread handler on CPUs specified by the irq affinity mask,
>>>>>> which was discussed by the following link:
>>>>>>
>>>>>>
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>>> e
>>>>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>>> 58f7d056383e%40huawei.com
>>>>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e273f45
>>>>> 176d1c08
>>>>>>
>>>>> d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370188
>>>>> 8401588
>>>>>>
>>>>> 9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCbawfxV
>>>>> Er3U%3D&amp;
>>>>>> reserved=0
>>>>>>
>>>>>> Could you try the above solution and see if the lockup can be avoided?
>>>>>> John Garry
>>>>>> should have workable patch.
>>>>>
>>>>> Yeah, so we experimented with changing the interrupt handling in the SCSI
>>>>> driver I maintain to use a threaded handler IRQ handler plus patch below,
>>>>> and saw a significant throughput boost:
>>>>>
>>>>> --->8
>>>>>
>>>>> Subject: [PATCH] genirq: Add support to allow thread to use hard irq affinity
>>>>>
>>>>> Currently the cpu allowed mask for the threaded part of a threaded irq
>>>>> handler will be set to the effective affinity of the hard irq.
>>>>>
>>>>> Typically the effective affinity of the hard irq will be for a single cpu. As such,
>>>>> the threaded handler would always run on the same cpu as the hard irq.
>>>>>
>>>>> We have seen scenarios in high data-rate throughput testing that the cpu
>>>>> handling the interrupt can be totally saturated handling both the hard
>>>>> interrupt and threaded handler parts, limiting throughput.
>>>>>
>>>>> Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the threaded
>>>>> interrupt to decide on the policy of which cpu the threaded handler may run.
>>>>>
>>>>> Signed-off-by: John Garry <john.garry@huawei.com>
>>
>> Thanks for pointing me to this patch. This fixed the interrupt swamp and make the system stable.
>>
>> However I'm seeing reduced performance when using threaded interrupts.
>>
>> Here are the test results on a system with 80 CPUs and 10 NVMe disks (32 hardware queues for each disk)
>> Benchmark tool is FIO, I/O pattern: 4k random reads on all NVMe disks, with queue depth = 64, num of jobs = 80, direct=1
>>
>> With threaded interrupts: 1320k IOPS
>> With just interrupts: 3720k IOPS
>> With just interrupts and my patch: 3700k IOPS
>
> This gap looks too big wrt. threaded interrupts vs. interrupts.
>
>>
>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the cost of doing wake up and context switch for NVMe threaded IRQ handler takes some CPU away.
>>
>
> In theory, it shouldn't be so because most of times the thread should be running
> on CPUs of this hctx, and the wakeup cost shouldn't be so big. Maybe there is
> performance problem somewhere wrt. threaded interrupt.
>
> Could you share us your test script and environment? I will see if I can
> reproduce it in my environment.
>
>> In this test, I made the following change to make use of IRQF_IRQ_AFFINITY for NVMe:
>>
>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c
>> index a1de501a2729..3fb30d16464e 100644
>> --- a/drivers/pci/irq.c
>> +++ b/drivers/pci/irq.c
>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int nr, irq_handler_t handler,
>>         va_list ap;
>>         int ret;
>>         char *devname;
>> -       unsigned long irqflags = IRQF_SHARED;
>> +       unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>>
>>         if (!handler)
>>                 irqflags |= IRQF_ONESHOT;
>>
>
> I don't see why IRQF_IRQ_AFFINITY is needed.
>
> John, could you explain it a bit why you need changes on IRQF_IRQ_AFFINITY?

Hi Ming,

The patch I shared was my original solution, based on the driver setting 
flag IRQF_IRQ_AFFINITY to request the threaded handler uses the irq 
affinity mask for the handler cpu allowed mask.

If we want to make this decision based only on whether the irq is 
managed or not, then we can drop the flag and just make the change as 
you suggest, below.

Thanks,
John

>
> The following patch should be enough:
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e8f7f179bf77..1e7cffc1c20c 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
>  	if (cpumask_available(desc->irq_common_data.affinity)) {
>  		const struct cpumask *m;
>
> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +		if (irqd_affinity_is_managed(&desc->irq_data))
> +			m = desc->irq_common_data.affinity;
> +		else
> +			m = irq_data_get_effective_affinity_mask(
> +					&desc->irq_data);
>  		cpumask_copy(mask, m);
>  	} else {
>  		valid = false;
>
>
> Thanks,
> Ming
>
> .
>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/3] sched: define a function to report the number of context switches on a CPU
  2019-08-21  8:20     ` Long Li
@ 2019-08-21 10:34       ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-08-21 10:34 UTC (permalink / raw)
  To: Long Li
  Cc: longli, Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel

On Wed, Aug 21, 2019 at 08:20:48AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 1/3] sched: define a function to report the number of
> >>>context switches on a CPU
> >>>
> >>>On Mon, Aug 19, 2019 at 11:14:27PM -0700, longli@linuxonhyperv.com
> >>>wrote:
> >>>> From: Long Li <longli@microsoft.com>
> >>>>
> >>>> The number of context switches on a CPU is useful to determine how
> >>>> busy this CPU is on processing IRQs. Export this information so it can
> >>>> be used by device drivers.
> >>>
> >>>Please do explain that; because I'm not seeing how number of switches
> >>>relates to processing IRQs _at_all_!
> 
> Some kernel components rely on context switch to progress, for example
> watchdog and RCU. On a CPU with reasonable interrupt load, it
> continues to make context switches, normally a number of switches per
> seconds. 

That isn't true; RCU is perfectly fine with a single task always running
and not making context switches, and so is the watchdog.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-21  8:37     ` Long Li
@ 2019-08-21 10:35       ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2019-08-21 10:35 UTC (permalink / raw)
  To: Long Li
  Cc: longli, Ingo Molnar, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel

On Wed, Aug 21, 2019 at 08:37:55AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Mon, Aug 19, 2019 at 11:14:29PM -0700, longli@linuxonhyperv.com
> >>>wrote:
> >>>> From: Long Li <longli@microsoft.com>
> >>>>
> >>>> When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> possible that the CPU this hardware queue is bound to is flooded by
> >>>> returning I/O for other CPUs.
> >>>>
> >>>> For example, consider the following scenario:
> >>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and
> >>>> 3 keep sending I/Os
> >>>>
> >>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> that CPU 0 spends all the time serving NVMe and other system
> >>>> interrupts, but doesn't have a chance to run in process context.
> >>>
> >>>Ideally -- and there is some code to affect this, the load-balancer will move
> >>>tasks away from this CPU.
> >>>
> >>>> To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> when it detects the scheduler is not making progress. This serves multiple
> >>>purposes:
> >>>
> >>>Suppose the task waiting for the IO completion is a RT task, and you've just
> >>>queued it to a regular work. This is an instant priority inversion.
> 
> This is a choice. We can either not "lock up" the CPU, or finish the I/O on time from IRQ handler. I think throttling only happens in extreme conditions, which is rare. The purpose is to make the whole system responsive and happy.

Can you please use a sane MUA.. this is unreadable garbage.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-21  9:44       ` Ming Lei
  2019-08-21 10:03         ` John Garry
@ 2019-08-21 16:27         ` Long Li
  2019-08-22  1:33           ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21 16:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, Ming Lei, longli, Jens Axboe, Sagi Grimberg,
	chenxiang, Peter Zijlstra, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
>>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>> >>>
>>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
>>>> >>>>>
>>>> >>>>> From: Long Li <longli@microsoft.com>
>>>> >>>>>
>>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>> >>>>>
>>>> >>>>> On large systems with many CPUs, a number of CPUs may share
>>>one
>>>> >>>NVMe
>>>> >>>>> hardware queue. It may have this situation where several CPUs
>>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
>>>> >>>>> the
>>>> >>>hardware queue is bound to.
>>>> >>>>> This may result in that CPU swamped by interrupts and stay in
>>>> >>>>> interrupt mode for extended time while other CPUs continue to
>>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
>>>> >>>>> the system
>>>> >>>unresponsive.
>>>> >>>>>
>>>> >>>>> This patch set addresses this by enforcing scheduling and
>>>> >>>>> throttling I/O when CPU is starved in this situation.
>>>> >>>>>
>>>> >>>>> Long Li (3):
>>>> >>>>>   sched: define a function to report the number of context switches
>>>on a
>>>> >>>>>     CPU
>>>> >>>>>   sched: export idle_cpu()
>>>> >>>>>   nvme: complete request in work queue on CPU with flooded
>>>> >>>>> interrupts
>>>> >>>>>
>>>> >>>>>  drivers/nvme/host/core.c | 57
>>>> >>>>> +++++++++++++++++++++++++++++++++++++++-
>>>> >>>>>  drivers/nvme/host/nvme.h |  1 +
>>>> >>>>>  include/linux/sched.h    |  2 ++
>>>> >>>>>  kernel/sched/core.c      |  7 +++++
>>>> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
>>>> >>>>
>>>> >>>> Another simpler solution may be to complete request in threaded
>>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
>>>> >>>> the interrupt thread handler on CPUs specified by the irq
>>>> >>>> affinity mask, which was discussed by the following link:
>>>> >>>>
>>>> >>>>
>>>> >>>https://lor
>>>> >>>e
>>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>> >>>58f7d056383e%40huawei.com
>>>> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
>>>73f45
>>>> >>>176d1c08
>>>> >>>>
>>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
>>>70188
>>>> >>>8401588
>>>> >>>>
>>>> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
>>>awfxV
>>>> >>>Er3U%3D&amp;
>>>> >>>> reserved=0
>>>> >>>>
>>>> >>>> Could you try the above solution and see if the lockup can be
>>>avoided?
>>>> >>>> John Garry
>>>> >>>> should have workable patch.
>>>> >>>
>>>> >>>Yeah, so we experimented with changing the interrupt handling in
>>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
>>>> >>>plus patch below, and saw a significant throughput boost:
>>>> >>>
>>>> >>>--->8
>>>> >>>
>>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
>>>> >>>irq affinity
>>>> >>>
>>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
>>>> >>>irq handler will be set to the effective affinity of the hard irq.
>>>> >>>
>>>> >>>Typically the effective affinity of the hard irq will be for a
>>>> >>>single cpu. As such, the threaded handler would always run on the
>>>same cpu as the hard irq.
>>>> >>>
>>>> >>>We have seen scenarios in high data-rate throughput testing that
>>>> >>>the cpu handling the interrupt can be totally saturated handling
>>>> >>>both the hard interrupt and threaded handler parts, limiting
>>>throughput.
>>>> >>>
>>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the
>>>> >>>threaded interrupt to decide on the policy of which cpu the threaded
>>>handler may run.
>>>> >>>
>>>> >>>Signed-off-by: John Garry <john.garry@huawei.com>
>>>>
>>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and
>>>make the system stable.
>>>>
>>>> However I'm seeing reduced performance when using threaded
>>>interrupts.
>>>>
>>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks
>>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern:
>>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs
>>>> = 80, direct=1
>>>>
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>> With just interrupts and my patch: 3700k IOPS
>>>
>>>This gap looks too big wrt. threaded interrupts vs. interrupts.
>>>
>>>>
>>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the
>>>cost of doing wake up and context switch for NVMe threaded IRQ handler
>>>takes some CPU away.
>>>>
>>>
>>>In theory, it shouldn't be so because most of times the thread should be
>>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big.
>>>Maybe there is performance problem somewhere wrt. threaded interrupt.
>>>
>>>Could you share us your test script and environment? I will see if I can
>>>reproduce it in my environment.

Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size.

Here is the command to benchmark it:

fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1

Thanks

Long

>>>
>>>> In this test, I made the following change to make use of
>>>IRQF_IRQ_AFFINITY for NVMe:
>>>>
>>>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c index
>>>> a1de501a2729..3fb30d16464e 100644
>>>> --- a/drivers/pci/irq.c
>>>> +++ b/drivers/pci/irq.c
>>>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int
>>>nr, irq_handler_t handler,
>>>>         va_list ap;
>>>>         int ret;
>>>>         char *devname;
>>>> -       unsigned long irqflags = IRQF_SHARED;
>>>> +       unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>>>>
>>>>         if (!handler)
>>>>                 irqflags |= IRQF_ONESHOT;
>>>>
>>>
>>>I don't see why IRQF_IRQ_AFFINITY is needed.
>>>
>>>John, could you explain it a bit why you need changes on
>>>IRQF_IRQ_AFFINITY?
>>>
>>>The following patch should be enough:
>>>
>>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index
>>>e8f7f179bf77..1e7cffc1c20c 100644
>>>--- a/kernel/irq/manage.c
>>>+++ b/kernel/irq/manage.c
>>>@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc,
>>>struct irqaction *action)
>>> 	if (cpumask_available(desc->irq_common_data.affinity)) {
>>> 		const struct cpumask *m;
>>>
>>>-		m = irq_data_get_effective_affinity_mask(&desc-
>>>>irq_data);
>>>+		if (irqd_affinity_is_managed(&desc->irq_data))
>>>+			m = desc->irq_common_data.affinity;
>>>+		else
>>>+			m = irq_data_get_effective_affinity_mask(
>>>+					&desc->irq_data);
>>> 		cpumask_copy(mask, m);
>>> 	} else {
>>> 		valid = false;
>>>
>>>
>>>Thanks,
>>>Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-21  8:39     ` Long Li
@ 2019-08-21 17:36       ` Long Li
  2019-08-21 21:54         ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-21 17:36 UTC (permalink / raw)
  To: Sagi Grimberg, longli, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

>>>Subject: RE: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on
>>>CPU
>>>>>>with flooded interrupts
>>>>>>
>>>>>>
>>>>>>> From: Long Li <longli@microsoft.com>
>>>>>>>
>>>>>>> When a NVMe hardware queue is mapped to several CPU queues, it is
>>>>>>> possible that the CPU this hardware queue is bound to is flooded by
>>>>>>> returning I/O for other CPUs.
>>>>>>>
>>>>>>> For example, consider the following scenario:
>>>>>>> 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>>>>> queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>>>>> and
>>>>>>> 3 keep sending I/Os
>>>>>>>
>>>>>>> CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>>>>> responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>>>>> that CPU 0 spends all the time serving NVMe and other system
>>>>>>> interrupts, but doesn't have a chance to run in process context.
>>>>>>>
>>>>>>> To fix this, CPU 0 can schedule a work to complete the I/O request
>>>>>>> when it detects the scheduler is not making progress. This serves
>>>>>>> multiple
>>>>>>purposes:
>>>>>>>
>>>>>>> 1. This CPU has to be scheduled to complete the request. The other
>>>>>>> CPUs can't issue more I/Os until some previous I/Os are completed.
>>>>>>> This helps this CPU get out of NVMe interrupts.
>>>>>>>
>>>>>>> 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>>>>> can not starve a CPU while servicing I/Os from other CPUs.
>>>>>>>
>>>>>>> 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>>>
>>>>>>The problem is indeed real, but this is the wrong approach in my mind.
>>>>>>
>>>>>>We already have irqpoll which takes care proper budgeting polling
>>>>>>cycles and not hogging the cpu.
>>>>>>
>>>>>>I've sent rfc for this particular problem before [1]. At the time
>>>>>>IIRC, Christoph suggested that we will poll the first batch directly
>>>>>>from the irq context and reap the rest in irqpoll handler.
>>>
>>>Thanks for the pointer. I will test and report back.

Sagi,

Here are the test results.

Benchmark command:
fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1

With your patch: 1720k IOPS
With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS

Interrupts are the fastest but we need to find a way to throttle it.

Thanks

Long


>>>
>>>>>>
>>>>>>[1]:
>>>>>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fl
>>>ists.
>>>>>>infradead.org%2Fpipermail%2Flinux-nvme%2F2016-
>>>>>>October%2F006497.html&amp;data=02%7C01%7Clongli%40microsoft.co
>>>m%
>>>>>>7C0ebf36eff15c4182116608d725948b93%7C72f988bf86f141af91ab2d7cd0
>>>11d
>>>>>>b47%7C1%7C0%7C637019192254250361&amp;sdata=fJ%2Fkc8HLSmfzaY
>>>3BY
>>>>>>E66zlZKD6FjcXgMJZzVGCVqI%2FU%3D&amp;reserved=0
>>>>>>
>>>>>>How about something like this instead:
>>>>>>--
>>>>>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>>>>>>71127a366d3c..84bf16d75109 100644
>>>>>>--- a/drivers/nvme/host/pci.c
>>>>>>+++ b/drivers/nvme/host/pci.c
>>>>>>@@ -24,6 +24,7 @@
>>>>>>  #include <linux/io-64-nonatomic-lo-hi.h>
>>>>>>  #include <linux/sed-opal.h>
>>>>>>  #include <linux/pci-p2pdma.h>
>>>>>>+#include <linux/irq_poll.h>
>>>>>>
>>>>>>  #include "trace.h"
>>>>>>  #include "nvme.h"
>>>>>>@@ -32,6 +33,7 @@
>>>>>>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>>>>>>
>>>>>>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>>>>+#define NVME_POLL_BUDGET_IRQ   256
>>>>>>
>>>>>>  /*
>>>>>>   * These can be higher, but we need to ensure that any command
>>>>>>doesn't @@ -189,6 +191,7 @@ struct nvme_queue {
>>>>>>         u32 *dbbuf_cq_db;
>>>>>>         u32 *dbbuf_sq_ei;
>>>>>>         u32 *dbbuf_cq_ei;
>>>>>>+       struct irq_poll iop;
>>>>>>         struct completion delete_done;  };
>>>>>>
>>>>>>@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct
>>>>>>nvme_queue *nvmeq, u16 *start,
>>>>>>         return found;
>>>>>>  }
>>>>>>
>>>>>>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>>>>>>+       struct nvme_queue *nvmeq = container_of(iop, struct
>>>>>>+nvme_queue,
>>>>>>iop);
>>>>>>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>>>>>>+       u16 start, end;
>>>>>>+       int completed;
>>>>>>+
>>>>>>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>>>>>>+       nvme_complete_cqes(nvmeq, start, end);
>>>>>>+       if (completed < budget) {
>>>>>>+               irq_poll_complete(&nvmeq->iop);
>>>>>>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>>>>>>+       }
>>>>>>+
>>>>>>+       return completed;
>>>>>>+}
>>>>>>+
>>>>>>  static irqreturn_t nvme_irq(int irq, void *data)
>>>>>>  {
>>>>>>         struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@
>>>>>>static irqreturn_t nvme_irq(int irq, void *data)
>>>>>>         rmb();
>>>>>>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>>>>>>                 ret = IRQ_HANDLED;
>>>>>>-       nvme_process_cq(nvmeq, &start, &end, -1);
>>>>>>+       nvme_process_cq(nvmeq, &start, &end,
>>>NVME_POLL_BUDGET_IRQ);
>>>>>>         nvmeq->last_cq_head = nvmeq->cq_head;
>>>>>>         wmb();
>>>>>>
>>>>>>         if (start != end) {
>>>>>>                 nvme_complete_cqes(nvmeq, start, end);
>>>>>>+               if (nvme_cqe_pending(nvmeq)) {
>>>>>>+                       disable_irq_nosync(irq);
>>>>>>+                       irq_poll_sched(&nvmeq->iop);
>>>>>>+               }
>>>>>>                 return IRQ_HANDLED;
>>>>>>         }
>>>>>>
>>>>>>@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return
>>>>>>nvme_timeout(struct request *req, bool reserved)
>>>>>>
>>>>>>  static void nvme_free_queue(struct nvme_queue *nvmeq)  {
>>>>>>+       irq_poll_disable(&nvmeq->iop);
>>>>>>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>>>>>>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>>>>>>         if (!nvmeq->sq_cmds)
>>>>>>@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev
>>>>>>*dev, int qid, int depth)
>>>>>>         nvmeq->dev = dev;
>>>>>>         spin_lock_init(&nvmeq->sq_lock);
>>>>>>         spin_lock_init(&nvmeq->cq_poll_lock);
>>>>>>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>>>>>>nvme_irqpoll_handler);
>>>>>>         nvmeq->cq_head = 0;
>>>>>>         nvmeq->cq_phase = 1;
>>>>>>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>>>>--

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-21 17:36       ` Long Li
@ 2019-08-21 21:54         ` Sagi Grimberg
  2019-08-24  0:13           ` Long Li
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2019-08-21 21:54 UTC (permalink / raw)
  To: Long Li, longli, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel


> Sagi,
> 
> Here are the test results.
> 
> Benchmark command:
> fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> 
> With your patch: 1720k IOPS
> With threaded interrupts: 1320k IOPS
> With just interrupts: 3720k IOPS
> 
> Interrupts are the fastest but we need to find a way to throttle it.

This is the workload that generates the flood?
If so I did not expect that this would be the perf difference..

If completions keep pounding on the cpu, I would expect irqpoll
to simply keep running forever and poll the cqs. There is no
fundamental reason why polling would be faster in an interrupt,
what matters could be:
1. we reschedule more than we need to
2. we don't reap enough completions in every poll round, which
will trigger rearming the interrupt and then when it fires reschedule
another softirq...

Maybe we need to take care of some irq_poll optimizations?

Does this (untested) patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..0e94183eba15 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,7 +12,8 @@
  #include <linux/irq_poll.h>
  #include <linux/delay.h>

-static unsigned int irq_poll_budget __read_mostly = 256;
+static unsigned int irq_poll_budget __read_mostly = 3000;
+unsigned int __read_mostly irqpoll_budget_usecs = 2000;

  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);

  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
  {
-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-       int rearm = 0, budget = irq_poll_budget;
-       unsigned long start_time = jiffies;
+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+       unsigned int budget = irq_poll_budget;
+       unsigned long time_limit =
+               jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
+       LIST_HEAD(list);

         local_irq_disable();
+       list_splice_init(irqpoll_list, &list);
+       local_irq_enable();

-       while (!list_empty(list)) {
+       while (!list_empty(&list)) {
                 struct irq_poll *iop;
                 int work, weight;

-               /*
-                * If softirq window is exhausted then punt.
-                */
-               if (budget <= 0 || time_after(jiffies, start_time)) {
-                       rearm = 1;
-                       break;
-               }
-
-               local_irq_enable();
-
                 /* Even though interrupts have been re-enabled, this
                  * access is safe because interrupts can only add new
                  * entries to the tail of this list, and only ->poll()
                  * calls can remove this head entry from the list.
                  */
-               iop = list_entry(list->next, struct irq_poll, list);
+               iop = list_first_entry(&list, struct irq_poll, list);

                 weight = iop->weight;
                 work = 0;
@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)

                 budget -= work;

-               local_irq_disable();
-
                 /*
                  * Drivers must not modify the iopoll state, if they
                  * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +118,21 @@ static void __latent_entropy 
irq_poll_softirq(struct softirq_action *h)
                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
                                 __irq_poll_complete(iop);
                         else
-                               list_move_tail(&iop->list, list);
+                               list_move_tail(&iop->list, &list);
                 }
+
+               /*
+                * If softirq window is exhausted then punt.
+                */
+               if (budget <= 0 || time_after_eq(jiffies, time_limit))
+                       break;
         }

-       if (rearm)
+       local_irq_disable();
+
+       list_splice_tail_init(irqpoll_list, &list);
+       list_splice(&list, irqpoll_list);
+       if (!list_empty(irqpoll_list))
                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

         local_irq_enable();
--

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-21 16:27         ` Long Li
@ 2019-08-22  1:33           ` Ming Lei
  2019-08-22  2:00             ` Keith Busch
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-08-22  1:33 UTC (permalink / raw)
  To: Long Li
  Cc: Keith Busch, Sagi Grimberg, chenxiang, Peter Zijlstra, Ming Lei,
	John Garry, Linux Kernel Mailing List, linux-nvme, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Christoph Hellwig, longli

On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>
> >>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
> >>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
> >>>> >>>
> >>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
> >>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@linuxonhyperv.com> wrote:
> >>>> >>>>>
> >>>> >>>>> From: Long Li <longli@microsoft.com>
> >>>> >>>>>
> >>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
> >>>> >>>>>
> >>>> >>>>> On large systems with many CPUs, a number of CPUs may share
> >>>one
> >>>> >>>NVMe
> >>>> >>>>> hardware queue. It may have this situation where several CPUs
> >>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
> >>>> >>>>> the
> >>>> >>>hardware queue is bound to.
> >>>> >>>>> This may result in that CPU swamped by interrupts and stay in
> >>>> >>>>> interrupt mode for extended time while other CPUs continue to
> >>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
> >>>> >>>>> the system
> >>>> >>>unresponsive.
> >>>> >>>>>
> >>>> >>>>> This patch set addresses this by enforcing scheduling and
> >>>> >>>>> throttling I/O when CPU is starved in this situation.
> >>>> >>>>>
> >>>> >>>>> Long Li (3):
> >>>> >>>>>   sched: define a function to report the number of context switches
> >>>on a
> >>>> >>>>>     CPU
> >>>> >>>>>   sched: export idle_cpu()
> >>>> >>>>>   nvme: complete request in work queue on CPU with flooded
> >>>> >>>>> interrupts
> >>>> >>>>>
> >>>> >>>>>  drivers/nvme/host/core.c | 57
> >>>> >>>>> +++++++++++++++++++++++++++++++++++++++-
> >>>> >>>>>  drivers/nvme/host/nvme.h |  1 +
> >>>> >>>>>  include/linux/sched.h    |  2 ++
> >>>> >>>>>  kernel/sched/core.c      |  7 +++++
> >>>> >>>>>  4 files changed, 66 insertions(+), 1 deletion(-)
> >>>> >>>>
> >>>> >>>> Another simpler solution may be to complete request in threaded
> >>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
> >>>> >>>> the interrupt thread handler on CPUs specified by the irq
> >>>> >>>> affinity mask, which was discussed by the following link:
> >>>> >>>>
> >>>> >>>>
> >>>> >>>https://lor
> >>>> >>>e
> >>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
> >>>> >>>58f7d056383e%40huawei.com
> >>>> >>>> %2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
> >>>73f45
> >>>> >>>176d1c08
> >>>> >>>>
> >>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
> >>>70188
> >>>> >>>8401588
> >>>> >>>>
> >>>> >>>9866&amp;sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
> >>>awfxV
> >>>> >>>Er3U%3D&amp;
> >>>> >>>> reserved=0
> >>>> >>>>
> >>>> >>>> Could you try the above solution and see if the lockup can be
> >>>avoided?
> >>>> >>>> John Garry
> >>>> >>>> should have workable patch.
> >>>> >>>
> >>>> >>>Yeah, so we experimented with changing the interrupt handling in
> >>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
> >>>> >>>plus patch below, and saw a significant throughput boost:
> >>>> >>>
> >>>> >>>--->8
> >>>> >>>
> >>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
> >>>> >>>irq affinity
> >>>> >>>
> >>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
> >>>> >>>irq handler will be set to the effective affinity of the hard irq.
> >>>> >>>
> >>>> >>>Typically the effective affinity of the hard irq will be for a
> >>>> >>>single cpu. As such, the threaded handler would always run on the
> >>>same cpu as the hard irq.
> >>>> >>>
> >>>> >>>We have seen scenarios in high data-rate throughput testing that
> >>>> >>>the cpu handling the interrupt can be totally saturated handling
> >>>> >>>both the hard interrupt and threaded handler parts, limiting
> >>>throughput.
> >>>> >>>
> >>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the
> >>>> >>>threaded interrupt to decide on the policy of which cpu the threaded
> >>>handler may run.
> >>>> >>>
> >>>> >>>Signed-off-by: John Garry <john.garry@huawei.com>
> >>>>
> >>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and
> >>>make the system stable.
> >>>>
> >>>> However I'm seeing reduced performance when using threaded
> >>>interrupts.
> >>>>
> >>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks
> >>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern:
> >>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs
> >>>> = 80, direct=1
> >>>>
> >>>> With threaded interrupts: 1320k IOPS
> >>>> With just interrupts: 3720k IOPS
> >>>> With just interrupts and my patch: 3700k IOPS
> >>>
> >>>This gap looks too big wrt. threaded interrupts vs. interrupts.
> >>>
> >>>>
> >>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the
> >>>cost of doing wake up and context switch for NVMe threaded IRQ handler
> >>>takes some CPU away.
> >>>>
> >>>
> >>>In theory, it shouldn't be so because most of times the thread should be
> >>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big.
> >>>Maybe there is performance problem somewhere wrt. threaded interrupt.
> >>>
> >>>Could you share us your test script and environment? I will see if I can
> >>>reproduce it in my environment.
> 
> Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size.
> 
> Here is the command to benchmark it:
> 
> fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> 

I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
each queue is served on 3 CPUs.

IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
context switch is increased a lot.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-22  1:33           ` Ming Lei
@ 2019-08-22  2:00             ` Keith Busch
  2019-08-22  2:23               ` Ming Lei
  2019-08-22  9:48               ` Thomas Gleixner
  0 siblings, 2 replies; 30+ messages in thread
From: Keith Busch @ 2019-08-22  2:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Long Li, Jens Axboe, Sagi Grimberg, chenxiang, Peter Zijlstra,
	Ming Lei, John Garry, Linux Kernel Mailing List, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	longli

On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > Here is the command to benchmark it:
> >
> > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> >
>
> I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> each queue is served on 3 CPUs.
>
> IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> context switch is increased a lot.

Interestingly use_threaded_interrupts shows a marginal improvement on
my machine with the same fio profile. It was only 5 NVMes, but they've
one queue per-cpu on 112 cores.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-22  2:00             ` Keith Busch
@ 2019-08-22  2:23               ` Ming Lei
  2019-08-22  9:48               ` Thomas Gleixner
  1 sibling, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-08-22  2:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Long Li, Jens Axboe, Sagi Grimberg, chenxiang,
	Peter Zijlstra, John Garry, Linux Kernel Mailing List,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, longli

[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]

On Thu, Aug 22, 2019 at 10:00 AM Keith Busch <keith.busch@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> > context switch is increased a lot.
>
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Not investigate it yet.

BTW, my fio test is only done on the single hw queue via 'taskset -c
$cpu_list_of_the_queue',
without applying the threaded interrupt affinity patch. NVMe is Optane.

The same issue can be reproduced after I force to use 1:1 mapping via passing
'possible_cpus=32' kernel cmd line.

Maybe related with kernel options, so attache the one I used, and
basically it is
a subset of RHEL8 kernel.

Thanks,
Ming Lei

[-- Attachment #2: config.tar.gz --]
[-- Type: application/gzip, Size: 31530 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/3] fix interrupt swamp in NVMe
  2019-08-22  2:00             ` Keith Busch
  2019-08-22  2:23               ` Ming Lei
@ 2019-08-22  9:48               ` Thomas Gleixner
  1 sibling, 0 replies; 30+ messages in thread
From: Thomas Gleixner @ 2019-08-22  9:48 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Long Li, Jens Axboe, Sagi Grimberg, chenxiang,
	Peter Zijlstra, Ming Lei, John Garry, Linux Kernel Mailing List,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig, longli

On Wed, 21 Aug 2019, Keith Busch wrote:
> On Wed, Aug 21, 2019 at 7:34 PM Ming Lei <ming.lei@redhat.com> wrote:
> > On Wed, Aug 21, 2019 at 04:27:00PM +0000, Long Li wrote:
> > > Here is the command to benchmark it:
> > >
> > > fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
> > >
> >
> > I can reproduce the issue on one machine(96 cores) with 4 NVMes(32 queues), so
> > each queue is served on 3 CPUs.
> >
> > IOPS drops > 20% when 'use_threaded_interrupts' is enabled. From fio log, CPU
> > context switch is increased a lot.
> 
> Interestingly use_threaded_interrupts shows a marginal improvement on
> my machine with the same fio profile. It was only 5 NVMes, but they've
> one queue per-cpu on 112 cores.

Which is not surprising because the thread and the hard interrupt are on
the same CPU and there is just that little overhead of the context switch.

The thing is that this really depends on how the scheduler decides to place
the interrupt thread.

If you have a queue for several CPUs, then depending on the load situation
allowing a multi-cpu affinity for the thread can cause lots of task
migration.

But restricting the irq thread to the CPU on which the interrupt is affine
can also starve that CPU. There is no universal rule for that.

Tracing should tell.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-20 17:33   ` Sagi Grimberg
  2019-08-21  8:39     ` Long Li
@ 2019-08-23  3:21     ` Ming Lei
  2019-08-24  0:27       ` Long Li
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-08-23  3:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: longli, Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, linux-nvme, linux-kernel, Long Li,
	Hannes Reinecke, linux-scsi, linux-block

On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> 
> > From: Long Li <longli@microsoft.com>
> > 
> > When a NVMe hardware queue is mapped to several CPU queues, it is possible
> > that the CPU this hardware queue is bound to is flooded by returning I/O for
> > other CPUs.
> > 
> > For example, consider the following scenario:
> > 1. CPU 0, 1, 2 and 3 share the same hardware queue
> > 2. the hardware queue interrupts CPU 0 for I/O response
> > 3. processes from CPU 1, 2 and 3 keep sending I/Os
> > 
> > CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
> > for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
> > all the time serving NVMe and other system interrupts, but doesn't have a
> > chance to run in process context.
> > 
> > To fix this, CPU 0 can schedule a work to complete the I/O request when it
> > detects the scheduler is not making progress. This serves multiple purposes:
> > 
> > 1. This CPU has to be scheduled to complete the request. The other CPUs can't
> > issue more I/Os until some previous I/Os are completed. This helps this CPU
> > get out of NVMe interrupts.
> > 
> > 2. This acts a throttling mechanisum for NVMe devices, in that it can not
> > starve a CPU while servicing I/Os from other CPUs.
> > 
> > 3. This CPU can make progress on RCU and other work items on its queue.
> 
> The problem is indeed real, but this is the wrong approach in my mind.
> 
> We already have irqpoll which takes care proper budgeting polling
> cycles and not hogging the cpu.

The issue isn't unique to NVMe, and can be any fast devices which
interrupts CPU too frequently, meantime the interrupt/softirq handler may
take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
handler, especially in case that multiple submission CPUs vs. single
completion CPU.

Some SCSI devices has the same problem too.

Could we consider to add one generic mechanism to cover this kind of
problem?

One approach I thought of is to allocate one backup thread for handling
such interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers. 

Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
check if this CPU has taken too long time for handling IRQ(interrupt or
softirq) and see if this CPU could be lock up. If yes, wakeup the backup
thread to handle the interrupt for avoiding lockup this CPU.

The threaded interrupt framework is there, and this way could be easier
to implement. Meantime most time the handler is run in interrupt context
and we may avoid the performance loss when CPU isn't busy enough.

Any comment on this approach?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-21 21:54         ` Sagi Grimberg
@ 2019-08-24  0:13           ` Long Li
  0 siblings, 0 replies; 30+ messages in thread
From: Long Li @ 2019-08-24  0:13 UTC (permalink / raw)
  To: Sagi Grimberg, longli, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> Sagi,
>>>>
>>>> Here are the test results.
>>>>
>>>> Benchmark command:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64
>>>> --
>>>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/d
>>>ev/nv
>>>>
>>>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev
>>>/nvme9n1
>>>> --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test
>>>> --group_reporting --gtod_reduce=1
>>>>
>>>> With your patch: 1720k IOPS
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>>
>>>> Interrupts are the fastest but we need to find a way to throttle it.
>>>
>>>This is the workload that generates the flood?
>>>If so I did not expect that this would be the perf difference..
>>>
>>>If completions keep pounding on the cpu, I would expect irqpoll to simply
>>>keep running forever and poll the cqs. There is no fundamental reason why
>>>polling would be faster in an interrupt, what matters could be:
>>>1. we reschedule more than we need to
>>>2. we don't reap enough completions in every poll round, which will trigger
>>>rearming the interrupt and then when it fires reschedule another softirq...
>>>

Yes I think it's the rescheduling that takes some. With the patch there are lots of ksoftirqd activities. (compared to nearly none without the patch)
A 90 seconds FIO run shows a big difference of context switches on all CPUs:
With patch: 5755849
Without patch: 1462931

>>>Maybe we need to take care of some irq_poll optimizations?
>>>
>>>Does this (untested) patch make any difference?
>>>--
>>>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15
>>>100644
>>>--- a/lib/irq_poll.c
>>>+++ b/lib/irq_poll.c
>>>@@ -12,7 +12,8 @@
>>>  #include <linux/irq_poll.h>
>>>  #include <linux/delay.h>
>>>
>>>-static unsigned int irq_poll_budget __read_mostly = 256;
>>>+static unsigned int irq_poll_budget __read_mostly = 3000; unsigned int
>>>+__read_mostly irqpoll_budget_usecs = 2000;
>>>
>>>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>>>
>>>@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);
>>>
>>>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>>>  {
>>>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>>>-       int rearm = 0, budget = irq_poll_budget;
>>>-       unsigned long start_time = jiffies;
>>>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>>>+       unsigned int budget = irq_poll_budget;
>>>+       unsigned long time_limit =
>>>+               jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
>>>+       LIST_HEAD(list);
>>>
>>>         local_irq_disable();
>>>+       list_splice_init(irqpoll_list, &list);
>>>+       local_irq_enable();
>>>
>>>-       while (!list_empty(list)) {
>>>+       while (!list_empty(&list)) {
>>>                 struct irq_poll *iop;
>>>                 int work, weight;
>>>
>>>-               /*
>>>-                * If softirq window is exhausted then punt.
>>>-                */
>>>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>>>-                       rearm = 1;
>>>-                       break;
>>>-               }
>>>-
>>>-               local_irq_enable();
>>>-
>>>                 /* Even though interrupts have been re-enabled, this
>>>                  * access is safe because interrupts can only add new
>>>                  * entries to the tail of this list, and only ->poll()
>>>                  * calls can remove this head entry from the list.
>>>                  */
>>>-               iop = list_entry(list->next, struct irq_poll, list);
>>>+               iop = list_first_entry(&list, struct irq_poll, list);
>>>
>>>                 weight = iop->weight;
>>>                 work = 0;
>>>@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>
>>>                 budget -= work;
>>>
>>>-               local_irq_disable();
>>>-
>>>                 /*
>>>                  * Drivers must not modify the iopoll state, if they
>>>                  * consume their assigned weight (or more, some drivers can't @@
>>>-125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>>>                                 __irq_poll_complete(iop);
>>>                         else
>>>-                               list_move_tail(&iop->list, list);
>>>+                               list_move_tail(&iop->list, &list);
>>>                 }
>>>+
>>>+               /*
>>>+                * If softirq window is exhausted then punt.
>>>+                */
>>>+               if (budget <= 0 || time_after_eq(jiffies, time_limit))
>>>+                       break;
>>>         }
>>>
>>>-       if (rearm)
>>>+       local_irq_disable();
>>>+
>>>+       list_splice_tail_init(irqpoll_list, &list);
>>>+       list_splice(&list, irqpoll_list);
>>>+       if (!list_empty(irqpoll_list))
>>>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>>>
>>>         local_irq_enable();
>>>--

It's got slightly better IOPS. With this 2nd patch, the number of context switches is at 5445863 (~5% improvement over the 1st patch).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-23  3:21     ` Ming Lei
@ 2019-08-24  0:27       ` Long Li
  2019-08-24 12:55         ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Long Li @ 2019-08-24  0:27 UTC (permalink / raw)
  To: Ming Lei, Sagi Grimberg
  Cc: longli, Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, linux-nvme, linux-kernel, Hannes Reinecke,
	linux-scsi, linux-block

>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
>>>>
>>>> > From: Long Li <longli@microsoft.com>
>>>> >
>>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
>>>> > possible that the CPU this hardware queue is bound to is flooded by
>>>> > returning I/O for other CPUs.
>>>> >
>>>> > For example, consider the following scenario:
>>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
>>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
>>>> > and 3 keep sending I/Os
>>>> >
>>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
>>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
>>>> > that CPU 0 spends all the time serving NVMe and other system
>>>> > interrupts, but doesn't have a chance to run in process context.
>>>> >
>>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
>>>> > when it detects the scheduler is not making progress. This serves
>>>multiple purposes:
>>>> >
>>>> > 1. This CPU has to be scheduled to complete the request. The other
>>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
>>>> > This helps this CPU get out of NVMe interrupts.
>>>> >
>>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
>>>> > can not starve a CPU while servicing I/Os from other CPUs.
>>>> >
>>>> > 3. This CPU can make progress on RCU and other work items on its
>>>queue.
>>>>
>>>> The problem is indeed real, but this is the wrong approach in my mind.
>>>>
>>>> We already have irqpoll which takes care proper budgeting polling
>>>> cycles and not hogging the cpu.
>>>
>>>The issue isn't unique to NVMe, and can be any fast devices which
>>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
>>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
>>>handler, especially in case that multiple submission CPUs vs. single
>>>completion CPU.
>>>
>>>Some SCSI devices has the same problem too.
>>>
>>>Could we consider to add one generic mechanism to cover this kind of
>>>problem?
>>>
>>>One approach I thought of is to allocate one backup thread for handling such
>>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
>>>
>>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
>>>check if this CPU has taken too long time for handling IRQ(interrupt or
>>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup

How do you know if this CPU is spending all the time in do_IRQ()?

Is it something like:
If (IRQ_time /elapsed_time > a threshold value)
	wake up the backup thread

>>>thread to handle the interrupt for avoiding lockup this CPU.
>>>
>>>The threaded interrupt framework is there, and this way could be easier to
>>>implement. Meantime most time the handler is run in interrupt context and
>>>we may avoid the performance loss when CPU isn't busy enough.
>>>
>>>Any comment on this approach?
>>>
>>>Thanks,
>>>Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
  2019-08-24  0:27       ` Long Li
@ 2019-08-24 12:55         ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-08-24 12:55 UTC (permalink / raw)
  To: Long Li
  Cc: Sagi Grimberg, longli, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel,
	Hannes Reinecke, linux-scsi, linux-block

On Sat, Aug 24, 2019 at 12:27:18AM +0000, Long Li wrote:
> >>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
> >>>with flooded interrupts
> >>>
> >>>On Tue, Aug 20, 2019 at 10:33:38AM -0700, Sagi Grimberg wrote:
> >>>>
> >>>> > From: Long Li <longli@microsoft.com>
> >>>> >
> >>>> > When a NVMe hardware queue is mapped to several CPU queues, it is
> >>>> > possible that the CPU this hardware queue is bound to is flooded by
> >>>> > returning I/O for other CPUs.
> >>>> >
> >>>> > For example, consider the following scenario:
> >>>> > 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware
> >>>> > queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2
> >>>> > and 3 keep sending I/Os
> >>>> >
> >>>> > CPU 0 may be flooded with interrupts from NVMe device that are I/O
> >>>> > responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible
> >>>> > that CPU 0 spends all the time serving NVMe and other system
> >>>> > interrupts, but doesn't have a chance to run in process context.
> >>>> >
> >>>> > To fix this, CPU 0 can schedule a work to complete the I/O request
> >>>> > when it detects the scheduler is not making progress. This serves
> >>>multiple purposes:
> >>>> >
> >>>> > 1. This CPU has to be scheduled to complete the request. The other
> >>>> > CPUs can't issue more I/Os until some previous I/Os are completed.
> >>>> > This helps this CPU get out of NVMe interrupts.
> >>>> >
> >>>> > 2. This acts a throttling mechanisum for NVMe devices, in that it
> >>>> > can not starve a CPU while servicing I/Os from other CPUs.
> >>>> >
> >>>> > 3. This CPU can make progress on RCU and other work items on its
> >>>queue.
> >>>>
> >>>> The problem is indeed real, but this is the wrong approach in my mind.
> >>>>
> >>>> We already have irqpoll which takes care proper budgeting polling
> >>>> cycles and not hogging the cpu.
> >>>
> >>>The issue isn't unique to NVMe, and can be any fast devices which
> >>>interrupts CPU too frequently, meantime the interrupt/softirq handler may
> >>>take a bit much time, then CPU is easy to be lockup by the interrupt/sofirq
> >>>handler, especially in case that multiple submission CPUs vs. single
> >>>completion CPU.
> >>>
> >>>Some SCSI devices has the same problem too.
> >>>
> >>>Could we consider to add one generic mechanism to cover this kind of
> >>>problem?
> >>>
> >>>One approach I thought of is to allocate one backup thread for handling such
> >>>interrupt, which can be marked as IRQF_BACKUP_THREAD by drivers.
> >>>
> >>>Inside do_IRQ(), irqtime is accounted, before calling action->handler(),
> >>>check if this CPU has taken too long time for handling IRQ(interrupt or
> >>>softirq) and see if this CPU could be lock up. If yes, wakeup the backup
> 
> How do you know if this CPU is spending all the time in do_IRQ()?
> 
> Is it something like:
> If (IRQ_time /elapsed_time > a threshold value)
> 	wake up the backup thread

Yeah, the above could work in theory.

Another approach I thought of is to monitor average irq gap time on each
CPU.

We could use EWMA(Exponential Weighted Moving Average) to do it simply,
such as:

	curr_irq_gap(cpu) = current start time of do_IRQ() on 'cpu' -
			end time of last do_IRQ() on 'cpu'
	avg_irq_gap(cpu) = weight_prev * avg_irq_gap(cpu) + weight_curr * curr_irq_gap(cpu) 

	note:
		weight_prev + weight_curr = 1

When avg_irq_gap(cpu) is close to one small enough threshold, we think irq flood is
detected.

'weight_prev' could be chosen as one big enough value for avoiding short-time flood.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-08-24 12:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-20  6:14 [PATCH 0/3] fix interrupt swamp in NVMe longli
2019-08-20  6:14 ` [PATCH 1/3] sched: define a function to report the number of context switches on a CPU longli
2019-08-20  9:38   ` Peter Zijlstra
2019-08-21  8:20     ` Long Li
2019-08-21 10:34       ` Peter Zijlstra
2019-08-20  9:39   ` Peter Zijlstra
2019-08-20  6:14 ` [PATCH 2/3] sched: export idle_cpu() longli
2019-08-20  6:14 ` [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts longli
2019-08-20  9:52   ` Peter Zijlstra
2019-08-21  8:37     ` Long Li
2019-08-21 10:35       ` Peter Zijlstra
2019-08-20 17:33   ` Sagi Grimberg
2019-08-21  8:39     ` Long Li
2019-08-21 17:36       ` Long Li
2019-08-21 21:54         ` Sagi Grimberg
2019-08-24  0:13           ` Long Li
2019-08-23  3:21     ` Ming Lei
2019-08-24  0:27       ` Long Li
2019-08-24 12:55         ` Ming Lei
2019-08-20  8:25 ` [PATCH 0/3] fix interrupt swamp in NVMe Ming Lei
2019-08-20  8:59   ` John Garry
2019-08-20 15:05     ` Keith Busch
2019-08-21  7:47     ` Long Li
2019-08-21  9:44       ` Ming Lei
2019-08-21 10:03         ` John Garry
2019-08-21 16:27         ` Long Li
2019-08-22  1:33           ` Ming Lei
2019-08-22  2:00             ` Keith Busch
2019-08-22  2:23               ` Ming Lei
2019-08-22  9:48               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).