Linux-SCSI Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood
@ 2019-08-27  8:53 Ming Lei
  2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

Hi Guys,

The 1st patch implements one simple EWMA based mechanism for detecting IRQ flood.

The 2nd patch adds IRQF_RESCUE_THREAD, and interrupts will be handled in
the created rescue thread in case that IRQ flood comes.

The 3rd patch applies the flag of IRQF_RESCURE_THREAD for NVMe.

The last patch uses irq's affinity in case of IRQF_RESCUE_THREAD.

Please review & comment!

Long, please test and see if your issue can be fixed.

Ming Lei (4):
  softirq: implement IRQ flood detection mechanism
  genirq: add IRQF_RESCUE_THREAD
  nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD

 drivers/base/cpu.c        | 25 +++++++++++++++++++++
 drivers/nvme/host/pci.c   | 17 +++++++++++++--
 include/linux/hardirq.h   |  2 ++
 include/linux/interrupt.h |  6 +++++
 kernel/irq/handle.c       |  6 ++++-
 kernel/irq/manage.c       | 25 ++++++++++++++++++++-
 kernel/softirq.c          | 46 +++++++++++++++++++++++++++++++++++++++
 7 files changed, 123 insertions(+), 4 deletions(-)

Cc: Long Li <longli@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: John Garry <john.garry@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org


-- 
2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
@ 2019-08-27  8:53 ` Ming Lei
  2019-08-27 14:42   ` Thomas Gleixner
  2019-08-29  6:15   ` Long Li
  2019-08-27  8:53 ` [PATCH 2/4] genirq: add IRQF_RESCUE_THREAD Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

For some high performance IO devices, interrupt may come very frequently,
meantime IO request completion may take a bit time. Especially on some
devices(SCSI or NVMe), IO requests can be submitted concurrently from
multiple CPU cores, however IO completion is only done on one of
these submission CPU cores.

Then IRQ flood can be easily triggered, and CPU lockup.

Implement one simple generic CPU IRQ flood detection mechanism. This
mechanism uses the CPU average interrupt interval to evaluate if IRQ flood
is triggered. The Exponential Weighted Moving Average(EWMA) is used to
compute CPU average interrupt interval.

Cc: Long Li <longli@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: John Garry <john.garry@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/base/cpu.c      | 25 ++++++++++++++++++++++
 include/linux/hardirq.h |  2 ++
 kernel/softirq.c        | 46 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index cc37511de866..7277d1aa0906 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -20,6 +20,7 @@
 #include <linux/tick.h>
 #include <linux/pm_qos.h>
 #include <linux/sched/isolation.h>
+#include <linux/hardirq.h>
 
 #include "base.h"
 
@@ -183,10 +184,33 @@ static struct attribute_group crash_note_cpu_attr_group = {
 };
 #endif
 
+static ssize_t show_irq_interval(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, dev);
+	ssize_t rc;
+	int cpunum;
+
+	cpunum = cpu->dev.id;
+
+	rc = sprintf(buf, "%llu\n", irq_get_avg_interval(cpunum));
+	return rc;
+}
+
+static DEVICE_ATTR(irq_interval, 0400, show_irq_interval, NULL);
+static struct attribute *irq_interval_cpu_attrs[] = {
+	&dev_attr_irq_interval.attr,
+	NULL
+};
+static struct attribute_group irq_interval_cpu_attr_group = {
+	.attrs = irq_interval_cpu_attrs,
+};
+
 static const struct attribute_group *common_cpu_attr_groups[] = {
 #ifdef CONFIG_KEXEC
 	&crash_note_cpu_attr_group,
 #endif
+	&irq_interval_cpu_attr_group,
 	NULL
 };
 
@@ -194,6 +218,7 @@ static const struct attribute_group *hotplugable_cpu_attr_groups[] = {
 #ifdef CONFIG_KEXEC
 	&crash_note_cpu_attr_group,
 #endif
+	&irq_interval_cpu_attr_group,
 	NULL
 };
 
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index da0af631ded5..fd394060ddb3 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -8,6 +8,8 @@
 #include <linux/vtime.h>
 #include <asm/hardirq.h>
 
+extern u64 irq_get_avg_interval(int cpu);
+extern bool irq_flood_detected(void);
 
 extern void synchronize_irq(unsigned int irq);
 extern bool synchronize_hardirq(unsigned int irq);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a4..96e01669a2e0 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -25,6 +25,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/sched/clock.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -52,6 +53,12 @@ DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
 EXPORT_PER_CPU_SYMBOL(irq_stat);
 #endif
 
+struct irq_interval {
+	u64                     last_irq_end;
+	u64                     avg;
+};
+DEFINE_PER_CPU(struct irq_interval, avg_irq_interval);
+
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
@@ -339,6 +346,41 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_restore(flags);
 }
 
+/*
+ * Update average irq interval with the Exponential Weighted Moving
+ * Average(EWMA)
+ */
+static void irq_update_interval(void)
+{
+#define IRQ_INTERVAL_EWMA_WEIGHT	128
+#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
+#define IRQ_INTERVAL_EWMA_CURR_FACTOR	(IRQ_INTERVAL_EWMA_WEIGHT - \
+		IRQ_INTERVAL_EWMA_PREV_FACTOR)
+
+	int cpu = raw_smp_processor_id();
+	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
+	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
+
+	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
+		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
+		IRQ_INTERVAL_EWMA_WEIGHT;
+}
+
+u64 irq_get_avg_interval(int cpu)
+{
+	return per_cpu_ptr(&avg_irq_interval, cpu)->avg;
+}
+
+/*
+ * If the average CPU irq interval is less than 8us, we think interrupt
+ * flood is detected on this CPU
+ */
+bool irq_flood_detected(void)
+{
+#define  IRQ_FLOOD_THRESHOLD_NS	8000
+	return raw_cpu_ptr(&avg_irq_interval)->avg <= IRQ_FLOOD_THRESHOLD_NS;
+}
+
 /*
  * Enter an interrupt context.
  */
@@ -356,6 +398,7 @@ void irq_enter(void)
 	}
 
 	__irq_enter();
+	irq_update_interval();
 }
 
 static inline void invoke_softirq(void)
@@ -402,6 +445,8 @@ static inline void tick_irq_exit(void)
  */
 void irq_exit(void)
 {
+	struct irq_interval *inter = raw_cpu_ptr(&avg_irq_interval);
+
 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
 	local_irq_disable();
 #else
@@ -413,6 +458,7 @@ void irq_exit(void)
 		invoke_softirq();
 
 	tick_irq_exit();
+	inter->last_irq_end = sched_clock_cpu(smp_processor_id());
 	rcu_irq_exit();
 	trace_hardirq_exit(); /* must be last! */
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/4] genirq: add IRQF_RESCUE_THREAD
  2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
  2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
@ 2019-08-27  8:53 ` Ming Lei
  2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
  2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
  3 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

For some high performance IO devices, interrupt may come very frequently,
meantime IO request completion may take a bit time. Especially on some
devices(SCSI or NVMe), IO requests can be submitted concurrently from
multiple CPU cores, however IO completion is only done on one of
these submission CPU cores.

Then IRQ flood can be easily triggered, and CPU lockup.

However, we can't simply use threaded IRQ for avoiding IRQ flood and
CPU lockup, because thread wakup introduces extra latency, which does
affect IO latency and throughput a lot.

Add IRQF_RESCUE_THREAD to create one interrupt thread handler for
avoiding irq flood, and the thread will be waken up for handling
interrupt in case that IRQ flood is going to be triggered

Cc: Long Li <longli@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: John Garry <john.garry@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/interrupt.h |  6 ++++++
 kernel/irq/handle.c       |  6 +++++-
 kernel/irq/manage.c       | 12 ++++++++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 5b8328a99b2a..19fdab18e679 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,11 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_RESCUE_THREAD - Interrupt will be handled in thread context in case
+ *                that irq flood is triggered. Can't afford to always handle
+ *                irq in thread context becasue IO performance drops much by
+ *                the extra wakeup latency. So one backup thread is created
+ *                for avoiding irq flood which may cause CPU lockup.
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +79,7 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_RESCUE_THREAD	0x00080000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index a4ace611f47f..8e5312c35483 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -146,7 +146,11 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 		irqreturn_t res;
 
 		trace_irq_handler_entry(irq, action);
-		res = action->handler(irq, action->dev_id);
+		if ((action->flags & IRQF_RESCUE_THREAD) &&
+				irq_flood_detected())
+			res = IRQ_WAKE_THREAD;
+		else
+			res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
 
 		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled interrupts\n",
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index e8f7f179bf77..1566abbf50e8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1082,6 +1082,8 @@ static int irq_thread(void *data)
 	if (force_irqthreads && test_bit(IRQTF_FORCED_THREAD,
 					&action->thread_flags))
 		handler_fn = irq_forced_thread_fn;
+	else if (action->flags & IRQF_RESCUE_THREAD)
+		handler_fn = irq_forced_thread_fn;
 	else
 		handler_fn = irq_thread_fn;
 
@@ -2011,6 +2013,16 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 		handler = irq_default_primary_handler;
 	}
 
+	if (irqflags & IRQF_RESCUE_THREAD) {
+		if (irqflags & IRQF_NO_THREAD)
+			return -EINVAL;
+		if (thread_fn)
+			return -EINVAL;
+		if (handler == irq_default_primary_handler)
+			return -EINVAL;
+		thread_fn = handler;
+	}
+
 	action = kzalloc(sizeof(struct irqaction), GFP_KERNEL);
 	if (!action)
 		return -ENOMEM;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
  2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
  2019-08-27  8:53 ` [PATCH 2/4] genirq: add IRQF_RESCUE_THREAD Ming Lei
@ 2019-08-27  8:53 ` Ming Lei
  2019-08-27  9:06   ` Johannes Thumshirn
  2019-08-27 15:10   ` Bart Van Assche
  2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
  3 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

If one vector is spread on several CPUs, usually the interrupt is only
handled on one of these CPUs. Meantime, IO can be issued to the single
hw queue from different CPUs concurrently, this way is easy to cause
IRQ flood and CPU lockup.

Pass IRQF_RESCURE_THREAD in above case for asking genirq to handle
interrupt in the rescurd thread when irq flood is detected.

Cc: Long Li <longli@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: John Garry <john.garry@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 45a80b708ef4..0b8d49470230 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1501,8 +1501,21 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
 		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
 				nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
 	} else {
-		return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
-				NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
+		char *devname;
+		const struct cpumask *mask;
+		unsigned long irqflags = IRQF_SHARED;
+		int vector = pci_irq_vector(pdev, nvmeq->cq_vector);
+
+		devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid);
+		if (!devname)
+			return -ENOMEM;
+
+		mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector);
+		if (mask && cpumask_weight(mask) > 1)
+			irqflags |= IRQF_RESCUE_THREAD;
+
+		return request_threaded_irq(vector, nvme_irq, NULL, irqflags,
+				devname, nvmeq);
 	}
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD
  2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
                   ` (2 preceding siblings ...)
  2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
@ 2019-08-27  8:53 ` Ming Lei
  2019-08-27 14:35   ` Keith Busch
  2019-09-06  8:50   ` John Garry
  3 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  8:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Ming Lei, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

In case of IRQF_RESCUE_THREAD, the threaded handler is only used to
handle interrupt when IRQ flood comes, use irq's affinity for this thread
so that scheduler may select other not too busy CPUs for handling the
interrupt.

Cc: Long Li <longli@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: John Garry <john.garry@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 kernel/irq/manage.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1566abbf50e8..03bc041348b7 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
 	if (cpumask_available(desc->irq_common_data.affinity)) {
 		const struct cpumask *m;
 
-		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
+		/*
+		 * Managed IRQ's affinity is setup gracefull on MUNA locality,
+		 * also if IRQF_RESCUE_THREAD is set, interrupt flood has been
+		 * triggered, so ask scheduler to run the thread on CPUs
+		 * specified by this interrupt's affinity.
+		 */
+		if ((action->flags & IRQF_RESCUE_THREAD) &&
+				irqd_affinity_is_managed(&desc->irq_data))
+			m = desc->irq_common_data.affinity;
+		else
+			m = irq_data_get_effective_affinity_mask(
+					&desc->irq_data);
 		cpumask_copy(mask, m);
 	} else {
 		valid = false;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
@ 2019-08-27  9:06   ` Johannes Thumshirn
  2019-08-27  9:09     ` Ming Lei
  2019-08-27 15:10   ` Bart Van Assche
  1 sibling, 1 reply; 64+ messages in thread
From: Johannes Thumshirn @ 2019-08-27  9:06 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

On 27/08/2019 10:53, Ming Lei wrote:
[...]
> +		char *devname;
> +		const struct cpumask *mask;
> +		unsigned long irqflags = IRQF_SHARED;
> +		int vector = pci_irq_vector(pdev, nvmeq->cq_vector);
> +
> +		devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid);
> +		if (!devname)
> +			return -ENOMEM;
> +
> +		mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector);
> +		if (mask && cpumask_weight(mask) > 1)
> +			irqflags |= IRQF_RESCUE_THREAD;
> +
> +		return request_threaded_irq(vector, nvme_irq, NULL, irqflags,
> +				devname, nvmeq);

This will leak 'devname' which gets allocated by kasprintf() a few lines
above.

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  9:06   ` Johannes Thumshirn
@ 2019-08-27  9:09     ` Ming Lei
  2019-08-27  9:12       ` Johannes Thumshirn
  2019-08-27 14:34       ` Keith Busch
  0 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-27  9:09 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Thomas Gleixner, linux-kernel, Long Li, Ingo Molnar,
	Peter Zijlstra, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, John Garry, Hannes Reinecke, linux-nvme,
	linux-scsi

On Tue, Aug 27, 2019 at 11:06:20AM +0200, Johannes Thumshirn wrote:
> On 27/08/2019 10:53, Ming Lei wrote:
> [...]
> > +		char *devname;
> > +		const struct cpumask *mask;
> > +		unsigned long irqflags = IRQF_SHARED;
> > +		int vector = pci_irq_vector(pdev, nvmeq->cq_vector);
> > +
> > +		devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid);
> > +		if (!devname)
> > +			return -ENOMEM;
> > +
> > +		mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector);
> > +		if (mask && cpumask_weight(mask) > 1)
> > +			irqflags |= IRQF_RESCUE_THREAD;
> > +
> > +		return request_threaded_irq(vector, nvme_irq, NULL, irqflags,
> > +				devname, nvmeq);
> 
> This will leak 'devname' which gets allocated by kasprintf() a few lines
> above.

It won't, please see pci_free_irq() in which free_irq() returns the
'devname' passed in.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  9:09     ` Ming Lei
@ 2019-08-27  9:12       ` Johannes Thumshirn
  2019-08-27 14:34       ` Keith Busch
  1 sibling, 0 replies; 64+ messages in thread
From: Johannes Thumshirn @ 2019-08-27  9:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, linux-kernel, Long Li, Ingo Molnar,
	Peter Zijlstra, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, John Garry, Hannes Reinecke, linux-nvme,
	linux-scsi

On 27/08/2019 11:09, Ming Lei wrote:
[...]
> 
> It won't, please see pci_free_irq() in which free_irq() returns the
> 'devname' passed in.

Ah ok, missed that out.

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  9:09     ` Ming Lei
  2019-08-27  9:12       ` Johannes Thumshirn
@ 2019-08-27 14:34       ` Keith Busch
  2019-08-27 14:44         ` Keith Busch
  1 sibling, 1 reply; 64+ messages in thread
From: Keith Busch @ 2019-08-27 14:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Johannes Thumshirn, Thomas Gleixner, linux-kernel, Long Li,
	Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, John Garry, Hannes Reinecke,
	linux-nvme, linux-scsi

On Tue, Aug 27, 2019 at 05:09:27PM +0800, Ming Lei wrote:
> On Tue, Aug 27, 2019 at 11:06:20AM +0200, Johannes Thumshirn wrote:
> > On 27/08/2019 10:53, Ming Lei wrote:
> > [...]
> > > +		char *devname;
> > > +		const struct cpumask *mask;
> > > +		unsigned long irqflags = IRQF_SHARED;
> > > +		int vector = pci_irq_vector(pdev, nvmeq->cq_vector);
> > > +
> > > +		devname = kasprintf(GFP_KERNEL, "nvme%dq%d", nr, nvmeq->qid);
> > > +		if (!devname)
> > > +			return -ENOMEM;
> > > +
> > > +		mask = pci_irq_get_affinity(pdev, nvmeq->cq_vector);
> > > +		if (mask && cpumask_weight(mask) > 1)
> > > +			irqflags |= IRQF_RESCUE_THREAD;
> > > +
> > > +		return request_threaded_irq(vector, nvme_irq, NULL, irqflags,
> > > +				devname, nvmeq);
> > 
> > This will leak 'devname' which gets allocated by kasprintf() a few lines
> > above.
> 
> It won't, please see pci_free_irq() in which free_irq() returns the
> 'devname' passed in.

Only if request_threaded_irq() doesn't return an error.

I think you should probably just have pci_irq_get_affinity() take a flags
argument, or make a new function like __pci_irq_get_affinity() that
pci_irq_get_affinity() can call with a default flag.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD
  2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
@ 2019-08-27 14:35   ` Keith Busch
  2019-09-06  8:50   ` John Garry
  1 sibling, 0 replies; 64+ messages in thread
From: Keith Busch @ 2019-08-27 14:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, linux-kernel, Long Li, Ingo Molnar,
	Peter Zijlstra, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, John Garry, Hannes Reinecke, linux-nvme,
	linux-scsi

On Tue, Aug 27, 2019 at 04:53:44PM +0800, Ming Lei wrote:
> In case of IRQF_RESCUE_THREAD, the threaded handler is only used to
> handle interrupt when IRQ flood comes, use irq's affinity for this thread
> so that scheduler may select other not too busy CPUs for handling the
> interrupt.
> 
> Cc: Long Li <longli@microsoft.com>
> Cc: Ingo Molnar <mingo@redhat.com>,
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: linux-nvme@lists.infradead.org
> Cc: linux-scsi@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  kernel/irq/manage.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1566abbf50e8..03bc041348b7 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
>  	if (cpumask_available(desc->irq_common_data.affinity)) {
>  		const struct cpumask *m;
>  
> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +		/*
> +		 * Managed IRQ's affinity is setup gracefull on MUNA locality,

s/MUNA/NUMA

> +		 * also if IRQF_RESCUE_THREAD is set, interrupt flood has been
> +		 * triggered, so ask scheduler to run the thread on CPUs
> +		 * specified by this interrupt's affinity.
> +		 */
> +		if ((action->flags & IRQF_RESCUE_THREAD) &&
> +				irqd_affinity_is_managed(&desc->irq_data))
> +			m = desc->irq_common_data.affinity;
> +		else
> +			m = irq_data_get_effective_affinity_mask(
> +					&desc->irq_data);
>  		cpumask_copy(mask, m);
>  	} else {
>  		valid = false;
> -- 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
@ 2019-08-27 14:42   ` Thomas Gleixner
  2019-08-27 16:19     ` Thomas Gleixner
  2019-08-27 22:58     ` Ming Lei
  2019-08-29  6:15   ` Long Li
  1 sibling, 2 replies; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-27 14:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

On Tue, 27 Aug 2019, Ming Lei wrote:
> +/*
> + * Update average irq interval with the Exponential Weighted Moving
> + * Average(EWMA)
> + */
> +static void irq_update_interval(void)
> +{
> +#define IRQ_INTERVAL_EWMA_WEIGHT	128
> +#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
> +#define IRQ_INTERVAL_EWMA_CURR_FACTOR	(IRQ_INTERVAL_EWMA_WEIGHT - \
> +		IRQ_INTERVAL_EWMA_PREV_FACTOR)

Please do not stick defines into a function body. That's horrible.

> +
> +	int cpu = raw_smp_processor_id();
> +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;

Why are you doing that raw_smp_processor_id() dance? The call site has
interrupts and preemption disabled.

Also how is that supposed to work when sched_clock is jiffies based?

> +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> +		IRQ_INTERVAL_EWMA_WEIGHT;

We definitely are not going to have a 64bit multiplication and division on
every interrupt. Asided of that this breaks 32bit builds all over the place.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27 14:34       ` Keith Busch
@ 2019-08-27 14:44         ` Keith Busch
  0 siblings, 0 replies; 64+ messages in thread
From: Keith Busch @ 2019-08-27 14:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Johannes Thumshirn, Thomas Gleixner, linux-kernel, Long Li,
	Ingo Molnar, Peter Zijlstra, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, John Garry, Hannes Reinecke,
	linux-nvme, linux-scsi

On Tue, Aug 27, 2019 at 08:34:21AM -0600, Keith Busch wrote:
> I think you should probably just have pci_irq_get_affinity() take a flags
> argument, or make a new function like __pci_irq_get_affinity() that
> pci_irq_get_affinity() can call with a default flag.

Sorry, copied the wrong function name; I meant pci_request_irq(),
not pci_irq_get_affinity().

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
  2019-08-27  9:06   ` Johannes Thumshirn
@ 2019-08-27 15:10   ` Bart Van Assche
  2019-08-28  1:45     ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Bart Van Assche @ 2019-08-27 15:10 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, linux-kernel, linux-nvme,
	Keith Busch, Ingo Molnar, Christoph Hellwig

On 8/27/19 1:53 AM, Ming Lei wrote:
> If one vector is spread on several CPUs, usually the interrupt is only
> handled on one of these CPUs.

Is that perhaps a limitation of x86 interrupt handling hardware? See 
also the description of physical and logical destination mode of the 
local APIC in the Intel documentation.

Does that limitation also apply to other platforms than x86?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 14:42   ` Thomas Gleixner
@ 2019-08-27 16:19     ` Thomas Gleixner
  2019-08-27 23:04       ` Ming Lei
  2019-08-27 22:58     ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-27 16:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Tue, 27 Aug 2019, Thomas Gleixner wrote:
> On Tue, 27 Aug 2019, Ming Lei wrote:
> > +/*
> > + * Update average irq interval with the Exponential Weighted Moving
> > + * Average(EWMA)
> > + */
> > +static void irq_update_interval(void)
> > +{
> > +#define IRQ_INTERVAL_EWMA_WEIGHT	128
> > +#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
> > +#define IRQ_INTERVAL_EWMA_CURR_FACTOR	(IRQ_INTERVAL_EWMA_WEIGHT - \
> > +		IRQ_INTERVAL_EWMA_PREV_FACTOR)
> 
> Please do not stick defines into a function body. That's horrible.
> 
> > +
> > +	int cpu = raw_smp_processor_id();
> > +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> > +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> 
> Why are you doing that raw_smp_processor_id() dance? The call site has
> interrupts and preemption disabled.
> 
> Also how is that supposed to work when sched_clock is jiffies based?
> 
> > +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> > +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> > +		IRQ_INTERVAL_EWMA_WEIGHT;
> 
> We definitely are not going to have a 64bit multiplication and division on
> every interrupt. Asided of that this breaks 32bit builds all over the place.

That said, we already have infrastructure for something like this in the
core interrupt code which is optimized to be lightweight in the fast path.

kernel/irq/timings.c

Talk to Daniel Lezcano (Cc'ed) how you can (ab)use that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 14:42   ` Thomas Gleixner
  2019-08-27 16:19     ` Thomas Gleixner
@ 2019-08-27 22:58     ` Ming Lei
  2019-08-27 23:09       ` Thomas Gleixner
  1 sibling, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-08-27 22:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

On Tue, Aug 27, 2019 at 04:42:02PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Aug 2019, Ming Lei wrote:
> > +/*
> > + * Update average irq interval with the Exponential Weighted Moving
> > + * Average(EWMA)
> > + */
> > +static void irq_update_interval(void)
> > +{
> > +#define IRQ_INTERVAL_EWMA_WEIGHT	128
> > +#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
> > +#define IRQ_INTERVAL_EWMA_CURR_FACTOR	(IRQ_INTERVAL_EWMA_WEIGHT - \
> > +		IRQ_INTERVAL_EWMA_PREV_FACTOR)
> 
> Please do not stick defines into a function body. That's horrible.

OK.

> 
> > +
> > +	int cpu = raw_smp_processor_id();
> > +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> > +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> 
> Why are you doing that raw_smp_processor_id() dance? The call site has
> interrupts and preemption disabled.

OK, will change to __smp_processor_id().

> 
> Also how is that supposed to work when sched_clock is jiffies based?

Good catch, looks ktime_get_ns() is needed.

> 
> > +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> > +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> > +		IRQ_INTERVAL_EWMA_WEIGHT;
> 
> We definitely are not going to have a 64bit multiplication and division on
> every interrupt. Asided of that this breaks 32bit builds all over the place.

I will convert the above into add/subtract/shift only.

thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 16:19     ` Thomas Gleixner
@ 2019-08-27 23:04       ` Ming Lei
  2019-08-27 23:12         ` Thomas Gleixner
  0 siblings, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-08-27 23:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Tue, Aug 27, 2019 at 06:19:00PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Aug 2019, Thomas Gleixner wrote:
> > On Tue, 27 Aug 2019, Ming Lei wrote:
> > > +/*
> > > + * Update average irq interval with the Exponential Weighted Moving
> > > + * Average(EWMA)
> > > + */
> > > +static void irq_update_interval(void)
> > > +{
> > > +#define IRQ_INTERVAL_EWMA_WEIGHT	128
> > > +#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
> > > +#define IRQ_INTERVAL_EWMA_CURR_FACTOR	(IRQ_INTERVAL_EWMA_WEIGHT - \
> > > +		IRQ_INTERVAL_EWMA_PREV_FACTOR)
> > 
> > Please do not stick defines into a function body. That's horrible.
> > 
> > > +
> > > +	int cpu = raw_smp_processor_id();
> > > +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> > > +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> > 
> > Why are you doing that raw_smp_processor_id() dance? The call site has
> > interrupts and preemption disabled.
> > 
> > Also how is that supposed to work when sched_clock is jiffies based?
> > 
> > > +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> > > +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> > > +		IRQ_INTERVAL_EWMA_WEIGHT;
> > 
> > We definitely are not going to have a 64bit multiplication and division on
> > every interrupt. Asided of that this breaks 32bit builds all over the place.
> 
> That said, we already have infrastructure for something like this in the
> core interrupt code which is optimized to be lightweight in the fast path.
> 
> kernel/irq/timings.c

irq/timings.c is much more complicated, especially it applies EWMA to
compute each irq's average interval. However, we only need it for
do_IRQ() to cover all interrupt & softirq handler.

Also CONFIG_IRQ_TIMINGS is usually disabled on distributions.

thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 22:58     ` Ming Lei
@ 2019-08-27 23:09       ` Thomas Gleixner
  2019-08-28 11:06         ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-27 23:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

On Wed, 28 Aug 2019, Ming Lei wrote:
> On Tue, Aug 27, 2019 at 04:42:02PM +0200, Thomas Gleixner wrote:
> > On Tue, 27 Aug 2019, Ming Lei wrote:
> > > +
> > > +	int cpu = raw_smp_processor_id();
> > > +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> > > +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> > 
> > Why are you doing that raw_smp_processor_id() dance? The call site has
> > interrupts and preemption disabled.
> 
> OK, will change to __smp_processor_id().

You do not need smp_processor_id() as all.

> > Also how is that supposed to work when sched_clock is jiffies based?
> 
> Good catch, looks ktime_get_ns() is needed.

And what is ktime_get_ns() returning when the only available clocksource is
jiffies?

> > 
> > > +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> > > +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> > > +		IRQ_INTERVAL_EWMA_WEIGHT;
> > 
> > We definitely are not going to have a 64bit multiplication and division on
> > every interrupt. Asided of that this breaks 32bit builds all over the place.
> 
> I will convert the above into add/subtract/shift only.

No. Talk to Daniel. There is not going to be yet another ad hoc thingy just
to make block happy.

And aside of that why does block not have a "NAPI" mode which was
explicitely designed to avoid all that?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 23:04       ` Ming Lei
@ 2019-08-27 23:12         ` Thomas Gleixner
  0 siblings, 0 replies; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-27 23:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Wed, 28 Aug 2019, Ming Lei wrote:
> On Tue, Aug 27, 2019 at 06:19:00PM +0200, Thomas Gleixner wrote:
> > > We definitely are not going to have a 64bit multiplication and division on
> > > every interrupt. Asided of that this breaks 32bit builds all over the place.
> > 
> > That said, we already have infrastructure for something like this in the
> > core interrupt code which is optimized to be lightweight in the fast path.
> > 
> > kernel/irq/timings.c
> 
> irq/timings.c is much more complicated, especially it applies EWMA to
> compute each irq's average interval. However, we only need it for
> do_IRQ() to cover all interrupt & softirq handler.

That does not justify yet another ad hoc thingy.

> Also CONFIG_IRQ_TIMINGS is usually disabled on distributions.

That's not an argument at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq
  2019-08-27 15:10   ` Bart Van Assche
@ 2019-08-28  1:45     ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-28  1:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Thomas Gleixner, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, linux-kernel,
	linux-nvme, Keith Busch, Ingo Molnar, Christoph Hellwig

On Tue, Aug 27, 2019 at 08:10:42AM -0700, Bart Van Assche wrote:
> On 8/27/19 1:53 AM, Ming Lei wrote:
> > If one vector is spread on several CPUs, usually the interrupt is only
> > handled on one of these CPUs.
> 
> Is that perhaps a limitation of x86 interrupt handling hardware? See also
> the description of physical and logical destination mode of the local APIC
> in the Intel documentation.
> 
> Does that limitation also apply to other platforms than x86?

Please see the following excellent explanation from Thomas.

	https://lkml.org/lkml/2018/4/4/734

Especially the following words:

	So at some point we ripped out the multi target support on X86 and moved
	everything to single target delivery mode.
	
	Other architectures never supported multi target delivery either due to
	hardware restrictions or for similar reasons why X86 dropped it. There
	might be a few architectures which support it, but I have no overview at
	the moment.


thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27 23:09       ` Thomas Gleixner
@ 2019-08-28 11:06         ` Ming Lei
  2019-08-28 11:23           ` Thomas Gleixner
  0 siblings, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-08-28 11:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> On Wed, 28 Aug 2019, Ming Lei wrote:
> > On Tue, Aug 27, 2019 at 04:42:02PM +0200, Thomas Gleixner wrote:
> > > On Tue, 27 Aug 2019, Ming Lei wrote:
> > > > +
> > > > +	int cpu = raw_smp_processor_id();
> > > > +	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> > > > +	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> > > 
> > > Why are you doing that raw_smp_processor_id() dance? The call site has
> > > interrupts and preemption disabled.
> > 
> > OK, will change to __smp_processor_id().
> 
> You do not need smp_processor_id() as all.

OK.

> 
> > > Also how is that supposed to work when sched_clock is jiffies based?
> > 
> > Good catch, looks ktime_get_ns() is needed.
> 
> And what is ktime_get_ns() returning when the only available clocksource is
> jiffies?

IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
expect high IO performance. Then it is fine to always handle the irq in
interrupt context or thread context.

However, if it can be recognized runtime, irq_flood_detected() can
always return true or false.

> 
> > > 
> > > > +	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> > > > +		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
> > > > +		IRQ_INTERVAL_EWMA_WEIGHT;
> > > 
> > > We definitely are not going to have a 64bit multiplication and division on
> > > every interrupt. Asided of that this breaks 32bit builds all over the place.
> > 
> > I will convert the above into add/subtract/shift only.
> 
> No. Talk to Daniel. There is not going to be yet another ad hoc thingy just
> to make block happy.

I just take a first glance at the code of 'interrupt timing', and its
motivation is to predicate of the next occurrence of interested interrupt
for use cases of PM, such as predicate wakeup time.

And predication could be one much more difficult thing, and its implementation
requires to record more info: such as irq number, recent multiple irq timestamps,
that means its overhead is a bit high. Meantime IRQS_TIMINGS should only be set
on interested interrupt(s).

Yeah, irq timing uses the Exponential Weighted Moving Average(EWMA) for computing
average irq interval for each interrupt.

So either motivation or approaches taken are totally different between
irq timing and irq flood detection.

Daniel, correct me if I am wrong.

For the case of detecting IRQ flood, we only need to sum the average
interval of any do_IRQ() on each CPU, and the overhead is quite low since
just two read & write on one percpu varible is required. We don't
care what the exact irq number is. However, we have to account time
taken by softirq handler, which can't be covered by interrupt timing.

Also it is quite simple to use EWMA to compute average interrupt
interval, however we can't reuse irq timing's result which is done on
each irq. IRQ flood detection simply computes the average interval
of any do_IRQ() on each CPU for covering handling interrupt and softirq.

> 
> And aside of that why does block not have a "NAPI" mode which was
> explicitely designed to avoid all that?

Block layer knows nothing about interrupt, even don't know which context
is run for completing IO request, which is decided by driver, could
be interrupt context, softirq context, or process context.

Also it is hard for block layer to figure out when IRQ flood is triggered.
CPU is shared resource, all kinds of interrupt sources may contribute
some for IRQ flood. That is why this patch implements the detection
mechanism in genirq/softirq code.

In theory, this patch provides one simple generic mechanism for avoiding
IRQ flood/CPU lockup, which could be used for any devices, not only high
performance storage.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-28 11:06         ` Ming Lei
@ 2019-08-28 11:23           ` Thomas Gleixner
  2019-08-28 13:50             ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-28 11:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Wed, 28 Aug 2019, Ming Lei wrote:
> On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> > > > Also how is that supposed to work when sched_clock is jiffies based?
> > > 
> > > Good catch, looks ktime_get_ns() is needed.
> > 
> > And what is ktime_get_ns() returning when the only available clocksource is
> > jiffies?
> 
> IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
> expect high IO performance. Then it is fine to always handle the irq in
> interrupt context or thread context.
> 
> However, if it can be recognized runtime, irq_flood_detected() can
> always return true or false.

Right. The clocksource is determined at runtime. And if there is no high
resolution clocksource then that function will return crap.
 
> > No. Talk to Daniel. There is not going to be yet another ad hoc thingy just
> > to make block happy.
> 
> I just take a first glance at the code of 'interrupt timing', and its
> motivation is to predicate of the next occurrence of interested interrupt
> for use cases of PM, such as predicate wakeup time.
> 
> And predication could be one much more difficult thing, and its implementation
> requires to record more info: such as irq number, recent multiple irq timestamps,
> that means its overhead is a bit high. Meantime IRQS_TIMINGS should only be set
> on interested interrupt(s).

Well, yes. But it's trivial enough to utilize parts of it for your
purposes.

> Daniel, correct me if I am wrong.

Daniel could have replied, if you'd put him on Cc ....

> In theory, this patch provides one simple generic mechanism for avoiding
> IRQ flood/CPU lockup, which could be used for any devices, not only high
> performance storage.

Right, we can add a lot of things to the kernel which provide simple
generic mechanisms for special purposes and if all of them are enabled then
we are doing the same thing over and over.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-28 11:23           ` Thomas Gleixner
@ 2019-08-28 13:50             ` Ming Lei
  2019-08-28 14:07               ` Thomas Gleixner
  0 siblings, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-08-28 13:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Wed, Aug 28, 2019 at 01:23:06PM +0200, Thomas Gleixner wrote:
> On Wed, 28 Aug 2019, Ming Lei wrote:
> > On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> > > > > Also how is that supposed to work when sched_clock is jiffies based?
> > > > 
> > > > Good catch, looks ktime_get_ns() is needed.
> > > 
> > > And what is ktime_get_ns() returning when the only available clocksource is
> > > jiffies?
> > 
> > IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
> > expect high IO performance. Then it is fine to always handle the irq in
> > interrupt context or thread context.
> > 
> > However, if it can be recognized runtime, irq_flood_detected() can
> > always return true or false.
> 
> Right. The clocksource is determined at runtime. And if there is no high
> resolution clocksource then that function will return crap.

This patch still works even though the only clocksource is jiffies.

>  
> > > No. Talk to Daniel. There is not going to be yet another ad hoc thingy just
> > > to make block happy.
> > 
> > I just take a first glance at the code of 'interrupt timing', and its
> > motivation is to predicate of the next occurrence of interested interrupt
> > for use cases of PM, such as predicate wakeup time.
> > 
> > And predication could be one much more difficult thing, and its implementation
> > requires to record more info: such as irq number, recent multiple irq timestamps,
> > that means its overhead is a bit high. Meantime IRQS_TIMINGS should only be set
> > on interested interrupt(s).
> 
> Well, yes. But it's trivial enough to utilize parts of it for your
> purposes.

From the code of kernel/irq/timing.c:

1) record_irq_time() only records the start time of one irq, and not
consider the time taken in interrupt handler, so we can't figure out
the real interval between two do_IRQ() on one CPU

2) irq/timing doesn't cover softirq

Daniel, could you take a look and see if irq flood detection can be
implemented easily by irq/timing.c?


thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-28 13:50             ` Ming Lei
@ 2019-08-28 14:07               ` Thomas Gleixner
  2019-09-03  3:30                 ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Thomas Gleixner @ 2019-08-28 14:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Wed, 28 Aug 2019, Ming Lei wrote:
> On Wed, Aug 28, 2019 at 01:23:06PM +0200, Thomas Gleixner wrote:
> > On Wed, 28 Aug 2019, Ming Lei wrote:
> > > On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> > > > > > Also how is that supposed to work when sched_clock is jiffies based?
> > > > > 
> > > > > Good catch, looks ktime_get_ns() is needed.
> > > > 
> > > > And what is ktime_get_ns() returning when the only available clocksource is
> > > > jiffies?
> > > 
> > > IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
> > > expect high IO performance. Then it is fine to always handle the irq in
> > > interrupt context or thread context.
> > > 
> > > However, if it can be recognized runtime, irq_flood_detected() can
> > > always return true or false.
> > 
> > Right. The clocksource is determined at runtime. And if there is no high
> > resolution clocksource then that function will return crap.
> 
> This patch still works even though the only clocksource is jiffies.

Works by some definition of works, right?

> > Well, yes. But it's trivial enough to utilize parts of it for your
> > purposes.
> 
> >From the code of kernel/irq/timing.c:
> 
> 1) record_irq_time() only records the start time of one irq, and not
> consider the time taken in interrupt handler, so we can't figure out
> the real interval between two do_IRQ() on one CPU

I said utilize and that means that the infrastructure can be used and
extended. I did not say that it solves your problem, right?

> 2) irq/timing doesn't cover softirq

That's solvable, right?
 
> Daniel, could you take a look and see if irq flood detection can be
> implemented easily by irq/timing.c?

I assume you can take a look as well, right?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
  2019-08-27 14:42   ` Thomas Gleixner
@ 2019-08-29  6:15   ` Long Li
  2019-08-30  0:55     ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Long Li @ 2019-08-29  6:15 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi

>>>For some high performance IO devices, interrupt may come very frequently,
>>>meantime IO request completion may take a bit time. Especially on some
>>>devices(SCSI or NVMe), IO requests can be submitted concurrently from
>>>multiple CPU cores, however IO completion is only done on one of these
>>>submission CPU cores.
>>>
>>>Then IRQ flood can be easily triggered, and CPU lockup.
>>>
>>>Implement one simple generic CPU IRQ flood detection mechanism. This
>>>mechanism uses the CPU average interrupt interval to evaluate if IRQ flood
>>>is triggered. The Exponential Weighted Moving Average(EWMA) is used to
>>>compute CPU average interrupt interval.
>>>
>>>Cc: Long Li <longli@microsoft.com>
>>>Cc: Ingo Molnar <mingo@redhat.com>,
>>>Cc: Peter Zijlstra <peterz@infradead.org>
>>>Cc: Keith Busch <keith.busch@intel.com>
>>>Cc: Jens Axboe <axboe@fb.com>
>>>Cc: Christoph Hellwig <hch@lst.de>
>>>Cc: Sagi Grimberg <sagi@grimberg.me>
>>>Cc: John Garry <john.garry@huawei.com>
>>>Cc: Thomas Gleixner <tglx@linutronix.de>
>>>Cc: Hannes Reinecke <hare@suse.com>
>>>Cc: linux-nvme@lists.infradead.org
>>>Cc: linux-scsi@vger.kernel.org
>>>Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>---
>>> drivers/base/cpu.c      | 25 ++++++++++++++++++++++
>>> include/linux/hardirq.h |  2 ++
>>> kernel/softirq.c        | 46
>>>+++++++++++++++++++++++++++++++++++++++++
>>> 3 files changed, 73 insertions(+)
>>>
>>>diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index
>>>cc37511de866..7277d1aa0906 100644
>>>--- a/drivers/base/cpu.c
>>>+++ b/drivers/base/cpu.c
>>>@@ -20,6 +20,7 @@
>>> #include <linux/tick.h>
>>> #include <linux/pm_qos.h>
>>> #include <linux/sched/isolation.h>
>>>+#include <linux/hardirq.h>
>>>
>>> #include "base.h"
>>>
>>>@@ -183,10 +184,33 @@ static struct attribute_group
>>>crash_note_cpu_attr_group = {  };  #endif
>>>
>>>+static ssize_t show_irq_interval(struct device *dev,
>>>+				 struct device_attribute *attr, char *buf) {
>>>+	struct cpu *cpu = container_of(dev, struct cpu, dev);
>>>+	ssize_t rc;
>>>+	int cpunum;
>>>+
>>>+	cpunum = cpu->dev.id;
>>>+
>>>+	rc = sprintf(buf, "%llu\n", irq_get_avg_interval(cpunum));
>>>+	return rc;
>>>+}
>>>+
>>>+static DEVICE_ATTR(irq_interval, 0400, show_irq_interval, NULL); static
>>>+struct attribute *irq_interval_cpu_attrs[] = {
>>>+	&dev_attr_irq_interval.attr,
>>>+	NULL
>>>+};
>>>+static struct attribute_group irq_interval_cpu_attr_group = {
>>>+	.attrs = irq_interval_cpu_attrs,
>>>+};
>>>+
>>> static const struct attribute_group *common_cpu_attr_groups[] = {  #ifdef
>>>CONFIG_KEXEC
>>> 	&crash_note_cpu_attr_group,
>>> #endif
>>>+	&irq_interval_cpu_attr_group,
>>> 	NULL
>>> };
>>>
>>>@@ -194,6 +218,7 @@ static const struct attribute_group
>>>*hotplugable_cpu_attr_groups[] = {  #ifdef CONFIG_KEXEC
>>> 	&crash_note_cpu_attr_group,
>>> #endif
>>>+	&irq_interval_cpu_attr_group,
>>> 	NULL
>>> };
>>>
>>>diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index
>>>da0af631ded5..fd394060ddb3 100644
>>>--- a/include/linux/hardirq.h
>>>+++ b/include/linux/hardirq.h
>>>@@ -8,6 +8,8 @@
>>> #include <linux/vtime.h>
>>> #include <asm/hardirq.h>
>>>
>>>+extern u64 irq_get_avg_interval(int cpu); extern bool
>>>+irq_flood_detected(void);
>>>
>>> extern void synchronize_irq(unsigned int irq);  extern bool
>>>synchronize_hardirq(unsigned int irq); diff --git a/kernel/softirq.c
>>>b/kernel/softirq.c index 0427a86743a4..96e01669a2e0 100644
>>>--- a/kernel/softirq.c
>>>+++ b/kernel/softirq.c
>>>@@ -25,6 +25,7 @@
>>> #include <linux/smpboot.h>
>>> #include <linux/tick.h>
>>> #include <linux/irq.h>
>>>+#include <linux/sched/clock.h>
>>>
>>> #define CREATE_TRACE_POINTS
>>> #include <trace/events/irq.h>
>>>@@ -52,6 +53,12 @@ DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
>>>EXPORT_PER_CPU_SYMBOL(irq_stat);  #endif
>>>
>>>+struct irq_interval {
>>>+	u64                     last_irq_end;
>>>+	u64                     avg;
>>>+};
>>>+DEFINE_PER_CPU(struct irq_interval, avg_irq_interval);
>>>+
>>> static struct softirq_action softirq_vec[NR_SOFTIRQS]
>>>__cacheline_aligned_in_smp;
>>>
>>> DEFINE_PER_CPU(struct task_struct *, ksoftirqd); @@ -339,6 +346,41 @@
>>>asmlinkage __visible void do_softirq(void)
>>> 	local_irq_restore(flags);
>>> }
>>>
>>>+/*
>>>+ * Update average irq interval with the Exponential Weighted Moving
>>>+ * Average(EWMA)
>>>+ */
>>>+static void irq_update_interval(void)
>>>+{
>>>+#define IRQ_INTERVAL_EWMA_WEIGHT	128
>>>+#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
>>>+#define IRQ_INTERVAL_EWMA_CURR_FACTOR
>>>	(IRQ_INTERVAL_EWMA_WEIGHT - \
>>>+		IRQ_INTERVAL_EWMA_PREV_FACTOR)
>>>+
>>>+	int cpu = raw_smp_processor_id();
>>>+	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
>>>+	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
>>>+
>>>+	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +

inter->avg will start with 0? maybe use a bigger value like IRQ_FLOOD_THRESHOLD_NS

>>>+		delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
>>>+		IRQ_INTERVAL_EWMA_WEIGHT;
>>>+}
>>>+
>>>+u64 irq_get_avg_interval(int cpu)
>>>+{
>>>+	return per_cpu_ptr(&avg_irq_interval, cpu)->avg; }
>>>+
>>>+/*
>>>+ * If the average CPU irq interval is less than 8us, we think interrupt
>>>+ * flood is detected on this CPU
>>>+ */
>>>+bool irq_flood_detected(void)
>>>+{
>>>+#define  IRQ_FLOOD_THRESHOLD_NS	8000
>>>+	return raw_cpu_ptr(&avg_irq_interval)->avg <=
>>>IRQ_FLOOD_THRESHOLD_NS;
>>>+}
>>>+
>>> /*
>>>  * Enter an interrupt context.
>>>  */
>>>@@ -356,6 +398,7 @@ void irq_enter(void)
>>> 	}
>>>
>>> 	__irq_enter();
>>>+	irq_update_interval();
>>> }
>>>
>>> static inline void invoke_softirq(void) @@ -402,6 +445,8 @@ static inline
>>>void tick_irq_exit(void)
>>>  */
>>> void irq_exit(void)
>>> {
>>>+	struct irq_interval *inter = raw_cpu_ptr(&avg_irq_interval);
>>>+
>>> #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
>>> 	local_irq_disable();
>>> #else
>>>@@ -413,6 +458,7 @@ void irq_exit(void)
>>> 		invoke_softirq();
>>>
>>> 	tick_irq_exit();
>>>+	inter->last_irq_end = sched_clock_cpu(smp_processor_id());
>>> 	rcu_irq_exit();
>>> 	trace_hardirq_exit(); /* must be last! */  }
>>>--
>>>2.20.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-29  6:15   ` Long Li
@ 2019-08-30  0:55     ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-08-30  0:55 UTC (permalink / raw)
  To: Long Li
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On Thu, Aug 29, 2019 at 06:15:00AM +0000, Long Li wrote:
> >>>For some high performance IO devices, interrupt may come very frequently,
> >>>meantime IO request completion may take a bit time. Especially on some
> >>>devices(SCSI or NVMe), IO requests can be submitted concurrently from
> >>>multiple CPU cores, however IO completion is only done on one of these
> >>>submission CPU cores.
> >>>
> >>>Then IRQ flood can be easily triggered, and CPU lockup.
> >>>
> >>>Implement one simple generic CPU IRQ flood detection mechanism. This
> >>>mechanism uses the CPU average interrupt interval to evaluate if IRQ flood
> >>>is triggered. The Exponential Weighted Moving Average(EWMA) is used to
> >>>compute CPU average interrupt interval.
> >>>
> >>>Cc: Long Li <longli@microsoft.com>
> >>>Cc: Ingo Molnar <mingo@redhat.com>,
> >>>Cc: Peter Zijlstra <peterz@infradead.org>
> >>>Cc: Keith Busch <keith.busch@intel.com>
> >>>Cc: Jens Axboe <axboe@fb.com>
> >>>Cc: Christoph Hellwig <hch@lst.de>
> >>>Cc: Sagi Grimberg <sagi@grimberg.me>
> >>>Cc: John Garry <john.garry@huawei.com>
> >>>Cc: Thomas Gleixner <tglx@linutronix.de>
> >>>Cc: Hannes Reinecke <hare@suse.com>
> >>>Cc: linux-nvme@lists.infradead.org
> >>>Cc: linux-scsi@vger.kernel.org
> >>>Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>---
> >>> drivers/base/cpu.c      | 25 ++++++++++++++++++++++
> >>> include/linux/hardirq.h |  2 ++
> >>> kernel/softirq.c        | 46
> >>>+++++++++++++++++++++++++++++++++++++++++
> >>> 3 files changed, 73 insertions(+)
> >>>
> >>>diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index
> >>>cc37511de866..7277d1aa0906 100644
> >>>--- a/drivers/base/cpu.c
> >>>+++ b/drivers/base/cpu.c
> >>>@@ -20,6 +20,7 @@
> >>> #include <linux/tick.h>
> >>> #include <linux/pm_qos.h>
> >>> #include <linux/sched/isolation.h>
> >>>+#include <linux/hardirq.h>
> >>>
> >>> #include "base.h"
> >>>
> >>>@@ -183,10 +184,33 @@ static struct attribute_group
> >>>crash_note_cpu_attr_group = {  };  #endif
> >>>
> >>>+static ssize_t show_irq_interval(struct device *dev,
> >>>+				 struct device_attribute *attr, char *buf) {
> >>>+	struct cpu *cpu = container_of(dev, struct cpu, dev);
> >>>+	ssize_t rc;
> >>>+	int cpunum;
> >>>+
> >>>+	cpunum = cpu->dev.id;
> >>>+
> >>>+	rc = sprintf(buf, "%llu\n", irq_get_avg_interval(cpunum));
> >>>+	return rc;
> >>>+}
> >>>+
> >>>+static DEVICE_ATTR(irq_interval, 0400, show_irq_interval, NULL); static
> >>>+struct attribute *irq_interval_cpu_attrs[] = {
> >>>+	&dev_attr_irq_interval.attr,
> >>>+	NULL
> >>>+};
> >>>+static struct attribute_group irq_interval_cpu_attr_group = {
> >>>+	.attrs = irq_interval_cpu_attrs,
> >>>+};
> >>>+
> >>> static const struct attribute_group *common_cpu_attr_groups[] = {  #ifdef
> >>>CONFIG_KEXEC
> >>> 	&crash_note_cpu_attr_group,
> >>> #endif
> >>>+	&irq_interval_cpu_attr_group,
> >>> 	NULL
> >>> };
> >>>
> >>>@@ -194,6 +218,7 @@ static const struct attribute_group
> >>>*hotplugable_cpu_attr_groups[] = {  #ifdef CONFIG_KEXEC
> >>> 	&crash_note_cpu_attr_group,
> >>> #endif
> >>>+	&irq_interval_cpu_attr_group,
> >>> 	NULL
> >>> };
> >>>
> >>>diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index
> >>>da0af631ded5..fd394060ddb3 100644
> >>>--- a/include/linux/hardirq.h
> >>>+++ b/include/linux/hardirq.h
> >>>@@ -8,6 +8,8 @@
> >>> #include <linux/vtime.h>
> >>> #include <asm/hardirq.h>
> >>>
> >>>+extern u64 irq_get_avg_interval(int cpu); extern bool
> >>>+irq_flood_detected(void);
> >>>
> >>> extern void synchronize_irq(unsigned int irq);  extern bool
> >>>synchronize_hardirq(unsigned int irq); diff --git a/kernel/softirq.c
> >>>b/kernel/softirq.c index 0427a86743a4..96e01669a2e0 100644
> >>>--- a/kernel/softirq.c
> >>>+++ b/kernel/softirq.c
> >>>@@ -25,6 +25,7 @@
> >>> #include <linux/smpboot.h>
> >>> #include <linux/tick.h>
> >>> #include <linux/irq.h>
> >>>+#include <linux/sched/clock.h>
> >>>
> >>> #define CREATE_TRACE_POINTS
> >>> #include <trace/events/irq.h>
> >>>@@ -52,6 +53,12 @@ DEFINE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat);
> >>>EXPORT_PER_CPU_SYMBOL(irq_stat);  #endif
> >>>
> >>>+struct irq_interval {
> >>>+	u64                     last_irq_end;
> >>>+	u64                     avg;
> >>>+};
> >>>+DEFINE_PER_CPU(struct irq_interval, avg_irq_interval);
> >>>+
> >>> static struct softirq_action softirq_vec[NR_SOFTIRQS]
> >>>__cacheline_aligned_in_smp;
> >>>
> >>> DEFINE_PER_CPU(struct task_struct *, ksoftirqd); @@ -339,6 +346,41 @@
> >>>asmlinkage __visible void do_softirq(void)
> >>> 	local_irq_restore(flags);
> >>> }
> >>>
> >>>+/*
> >>>+ * Update average irq interval with the Exponential Weighted Moving
> >>>+ * Average(EWMA)
> >>>+ */
> >>>+static void irq_update_interval(void)
> >>>+{
> >>>+#define IRQ_INTERVAL_EWMA_WEIGHT	128
> >>>+#define IRQ_INTERVAL_EWMA_PREV_FACTOR	127
> >>>+#define IRQ_INTERVAL_EWMA_CURR_FACTOR
> >>>	(IRQ_INTERVAL_EWMA_WEIGHT - \
> >>>+		IRQ_INTERVAL_EWMA_PREV_FACTOR)
> >>>+
> >>>+	int cpu = raw_smp_processor_id();
> >>>+	struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
> >>>+	u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;
> >>>+
> >>>+	inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
> 
> inter->avg will start with 0? maybe use a bigger value like IRQ_FLOOD_THRESHOLD_NS

I won't be a big deal, any initial value should be fine since it is Exponential
Weighted Moving Average.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-08-28 14:07               ` Thomas Gleixner
@ 2019-09-03  3:30                 ` Ming Lei
  2019-09-03  5:59                   ` Daniel Lezcano
  0 siblings, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-09-03  3:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi, Daniel Lezcano

On Wed, Aug 28, 2019 at 04:07:19PM +0200, Thomas Gleixner wrote:
> On Wed, 28 Aug 2019, Ming Lei wrote:
> > On Wed, Aug 28, 2019 at 01:23:06PM +0200, Thomas Gleixner wrote:
> > > On Wed, 28 Aug 2019, Ming Lei wrote:
> > > > On Wed, Aug 28, 2019 at 01:09:44AM +0200, Thomas Gleixner wrote:
> > > > > > > Also how is that supposed to work when sched_clock is jiffies based?
> > > > > > 
> > > > > > Good catch, looks ktime_get_ns() is needed.
> > > > > 
> > > > > And what is ktime_get_ns() returning when the only available clocksource is
> > > > > jiffies?
> > > > 
> > > > IMO, it isn't one issue. If the only clocksource is jiffies, we needn't to
> > > > expect high IO performance. Then it is fine to always handle the irq in
> > > > interrupt context or thread context.
> > > > 
> > > > However, if it can be recognized runtime, irq_flood_detected() can
> > > > always return true or false.
> > > 
> > > Right. The clocksource is determined at runtime. And if there is no high
> > > resolution clocksource then that function will return crap.
> > 
> > This patch still works even though the only clocksource is jiffies.
> 
> Works by some definition of works, right?

I am not sure there is such system which doesn't provide any high resolution
clocksource, meantime there is one high performance storage device
attached, and expect top IO performance can be reached.

Suppose there is such system: I mean that irq_flood_detected() returns either
true or false, then the actual IO performance should be accepted on
system without high resolution clocksource from user view.

> 
> > > Well, yes. But it's trivial enough to utilize parts of it for your
> > > purposes.
> > 
> > >From the code of kernel/irq/timing.c:
> > 
> > 1) record_irq_time() only records the start time of one irq, and not
> > consider the time taken in interrupt handler, so we can't figure out
> > the real interval between two do_IRQ() on one CPU
> 
> I said utilize and that means that the infrastructure can be used and
> extended. I did not say that it solves your problem, right?

The infrastructure is for predicating when the next interrupt comes,
which is used in PM cases(usually for mobile phone or power sensitive
cases). However, IRQ flood is used in high performance system(usually
enterprise case). The two use cases are actually orthogonal, also:

1) if the irq timing infrastructure is used, we have to apply the
management code on irq flood detection, for example, we have to
build the irq timing code in kernel and enable it. Then performance
regression might be caused for enterprise application.

2) irq timing's runtime overload is much higher, irq_timings_push()
touches much more memory footprint, since it records recent 32
irq's timestamp. That isn't what IRQ flood detection wants, also
not enough for flood detection.

3) irq flood detection itself is very simple, just one EWMA
calculation, see the following code:

irq_update_interval() (called from irq_enter())
        int cpu = raw_smp_processor_id();
        struct irq_interval *inter = per_cpu_ptr(&avg_irq_interval, cpu);
        u64 delta = sched_clock_cpu(cpu) - inter->last_irq_end;

        inter->avg = (inter->avg * IRQ_INTERVAL_EWMA_PREV_FACTOR +
                delta * IRQ_INTERVAL_EWMA_CURR_FACTOR) /
                IRQ_INTERVAL_EWMA_WEIGHT;

bool irq_flood_detected(void) (called from __handle_irq_event_percpu())
{
        return raw_cpu_ptr(&avg_irq_interval)->avg <= IRQ_FLOOD_THRESHOLD_NS;
}

irq_exit()
	inter->last_irq_end = sched_clock_cpu(smp_processor_id());

So there is basically nothing shared between the two, only one percpu
variable is needed for detecting irq flood.

> 
> > 2) irq/timing doesn't cover softirq
> 
> That's solvable, right?

Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
focuses on hardirq predication, and softirq isn't involved in that
purpose.

>  
> > Daniel, could you take a look and see if irq flood detection can be
> > implemented easily by irq/timing.c?
> 
> I assume you can take a look as well, right?

Yeah, I have looked at the code for a while, but I think that irq/timing
could become complicated unnecessarily for covering irq flood detection,
meantime it is much less efficient for detecting IRQ flood.

thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  3:30                 ` Ming Lei
@ 2019-09-03  5:59                   ` Daniel Lezcano
  2019-09-03  6:31                     ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-03  5:59 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: LKML, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, John Garry,
	Hannes Reinecke, linux-nvme, linux-scsi


Hi Ming Lei,

On 03/09/2019 05:30, Ming Lei wrote:

[ ... ]


>>> 2) irq/timing doesn't cover softirq
>>
>> That's solvable, right?
> 
> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> focuses on hardirq predication, and softirq isn't involved in that
> purpose.
> 
>>  
>>> Daniel, could you take a look and see if irq flood detection can be
>>> implemented easily by irq/timing.c?
>>
>> I assume you can take a look as well, right?
> 
> Yeah, I have looked at the code for a while, but I think that irq/timing
> could become complicated unnecessarily for covering irq flood detection,
> meantime it is much less efficient for detecting IRQ flood.

In the series, there is nothing describing rigorously the problem (I can
only guess) and why the proposed solution solves it.

What is your definition of an 'irq flood'? A high irq load? An irq
arriving while we are processing the previous one in the bottom halves?

The patch 2/4 description says "however IO completion is only done on
one of these submission CPU cores". That describes the bottleneck and
then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
thread handler", what is the rational between the bottleneck (problem)
and the irqf_rescue_thread (solution)?

Is it really the solution to track the irq timings to detect a flood?



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  5:59                   ` Daniel Lezcano
@ 2019-09-03  6:31                     ` Ming Lei
  2019-09-03  6:40                       ` Daniel Lezcano
  0 siblings, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-09-03  6:31 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Thomas Gleixner, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

Hi Daniel,

On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> 
> Hi Ming Lei,
> 
> On 03/09/2019 05:30, Ming Lei wrote:
> 
> [ ... ]
> 
> 
> >>> 2) irq/timing doesn't cover softirq
> >>
> >> That's solvable, right?
> > 
> > Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> > focuses on hardirq predication, and softirq isn't involved in that
> > purpose.
> > 
> >>  
> >>> Daniel, could you take a look and see if irq flood detection can be
> >>> implemented easily by irq/timing.c?
> >>
> >> I assume you can take a look as well, right?
> > 
> > Yeah, I have looked at the code for a while, but I think that irq/timing
> > could become complicated unnecessarily for covering irq flood detection,
> > meantime it is much less efficient for detecting IRQ flood.
> 
> In the series, there is nothing describing rigorously the problem (I can
> only guess) and why the proposed solution solves it.
> 
> What is your definition of an 'irq flood'? A high irq load? An irq
> arriving while we are processing the previous one in the bottom halves?

So far, it means that handling interrupt & softirq takes all utilization
of one CPU, then processes can't be run on this CPU basically, usually
sort of CPU lockup warning will be triggered.

> 
> The patch 2/4 description says "however IO completion is only done on
> one of these submission CPU cores". That describes the bottleneck and
> then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
> thread handler", what is the rational between the bottleneck (problem)
> and the irqf_rescue_thread (solution)?

The solution is to switch to handle this interrupt on the created rescue
irq thread context when irq flood is detected, and 'this interrupt' means
the interrupt requested with IRQF_RESCUE_THREAD.

> 
> Is it really the solution to track the irq timings to detect a flood?

The solution tracks the time taken on running do_IRQ() for each CPU.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  6:31                     ` Ming Lei
@ 2019-09-03  6:40                       ` Daniel Lezcano
  2019-09-03  7:28                         ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-03  6:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On 03/09/2019 08:31, Ming Lei wrote:
> Hi Daniel,
> 
> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
>>
>> Hi Ming Lei,
>>
>> On 03/09/2019 05:30, Ming Lei wrote:
>>
>> [ ... ]
>>
>>
>>>>> 2) irq/timing doesn't cover softirq
>>>>
>>>> That's solvable, right?
>>>
>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
>>> focuses on hardirq predication, and softirq isn't involved in that
>>> purpose.
>>>
>>>>  
>>>>> Daniel, could you take a look and see if irq flood detection can be
>>>>> implemented easily by irq/timing.c?
>>>>
>>>> I assume you can take a look as well, right?
>>>
>>> Yeah, I have looked at the code for a while, but I think that irq/timing
>>> could become complicated unnecessarily for covering irq flood detection,
>>> meantime it is much less efficient for detecting IRQ flood.
>>
>> In the series, there is nothing describing rigorously the problem (I can
>> only guess) and why the proposed solution solves it.
>>
>> What is your definition of an 'irq flood'? A high irq load? An irq
>> arriving while we are processing the previous one in the bottom halves?
> 
> So far, it means that handling interrupt & softirq takes all utilization
> of one CPU, then processes can't be run on this CPU basically, usually
> sort of CPU lockup warning will be triggered.

It is a scheduler problem then ?

>> The patch 2/4 description says "however IO completion is only done on
>> one of these submission CPU cores". That describes the bottleneck and
>> then the patch says "Add IRQF_RESCUE_THREAD to create one interrupt
>> thread handler", what is the rational between the bottleneck (problem)
>> and the irqf_rescue_thread (solution)?
> 
> The solution is to switch to handle this interrupt on the created rescue
> irq thread context when irq flood is detected, and 'this interrupt' means
> the interrupt requested with IRQF_RESCUE_THREAD.
> 
>>
>> Is it really the solution to track the irq timings to detect a flood?
> 
> The solution tracks the time taken on running do_IRQ() for each CPU.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  6:40                       ` Daniel Lezcano
@ 2019-09-03  7:28                         ` Ming Lei
  2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  8:09                           ` Thomas Gleixner
  0 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-03  7:28 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Thomas Gleixner, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> On 03/09/2019 08:31, Ming Lei wrote:
> > Hi Daniel,
> > 
> > On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> >>
> >> Hi Ming Lei,
> >>
> >> On 03/09/2019 05:30, Ming Lei wrote:
> >>
> >> [ ... ]
> >>
> >>
> >>>>> 2) irq/timing doesn't cover softirq
> >>>>
> >>>> That's solvable, right?
> >>>
> >>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> >>> focuses on hardirq predication, and softirq isn't involved in that
> >>> purpose.
> >>>
> >>>>  
> >>>>> Daniel, could you take a look and see if irq flood detection can be
> >>>>> implemented easily by irq/timing.c?
> >>>>
> >>>> I assume you can take a look as well, right?
> >>>
> >>> Yeah, I have looked at the code for a while, but I think that irq/timing
> >>> could become complicated unnecessarily for covering irq flood detection,
> >>> meantime it is much less efficient for detecting IRQ flood.
> >>
> >> In the series, there is nothing describing rigorously the problem (I can
> >> only guess) and why the proposed solution solves it.
> >>
> >> What is your definition of an 'irq flood'? A high irq load? An irq
> >> arriving while we are processing the previous one in the bottom halves?
> > 
> > So far, it means that handling interrupt & softirq takes all utilization
> > of one CPU, then processes can't be run on this CPU basically, usually
> > sort of CPU lockup warning will be triggered.
> 
> It is a scheduler problem then ?

Scheduler can do nothing if the CPU is taken completely by handling
interrupt & softirq, so seems not a scheduler problem, IMO.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:28                         ` Ming Lei
@ 2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  9:30                             ` Ming Lei
  2019-09-04 17:07                             ` Bart Van Assche
  2019-09-03  8:09                           ` Thomas Gleixner
  1 sibling, 2 replies; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-03  7:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On 03/09/2019 09:28, Ming Lei wrote:
> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>> On 03/09/2019 08:31, Ming Lei wrote:
>>> Hi Daniel,
>>>
>>> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
>>>>
>>>> Hi Ming Lei,
>>>>
>>>> On 03/09/2019 05:30, Ming Lei wrote:
>>>>
>>>> [ ... ]
>>>>
>>>>
>>>>>>> 2) irq/timing doesn't cover softirq
>>>>>>
>>>>>> That's solvable, right?
>>>>>
>>>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
>>>>> focuses on hardirq predication, and softirq isn't involved in that
>>>>> purpose.
>>>>>
>>>>>>  
>>>>>>> Daniel, could you take a look and see if irq flood detection can be
>>>>>>> implemented easily by irq/timing.c?
>>>>>>
>>>>>> I assume you can take a look as well, right?
>>>>>
>>>>> Yeah, I have looked at the code for a while, but I think that irq/timing
>>>>> could become complicated unnecessarily for covering irq flood detection,
>>>>> meantime it is much less efficient for detecting IRQ flood.
>>>>
>>>> In the series, there is nothing describing rigorously the problem (I can
>>>> only guess) and why the proposed solution solves it.
>>>>
>>>> What is your definition of an 'irq flood'? A high irq load? An irq
>>>> arriving while we are processing the previous one in the bottom halves?
>>>
>>> So far, it means that handling interrupt & softirq takes all utilization
>>> of one CPU, then processes can't be run on this CPU basically, usually
>>> sort of CPU lockup warning will be triggered.
>>
>> It is a scheduler problem then ?
> 
> Scheduler can do nothing if the CPU is taken completely by handling
> interrupt & softirq, so seems not a scheduler problem, IMO.

Why? If there is a irq pressure on one CPU reducing its capacity, the
scheduler will balance the tasks on another CPU, no?



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:28                         ` Ming Lei
  2019-09-03  7:50                           ` Daniel Lezcano
@ 2019-09-03  8:09                           ` Thomas Gleixner
  2019-09-03  9:24                             ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Thomas Gleixner @ 2019-09-03  8:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Daniel Lezcano, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On Tue, 3 Sep 2019, Ming Lei wrote:
> Scheduler can do nothing if the CPU is taken completely by handling
> interrupt & softirq, so seems not a scheduler problem, IMO.

Well, but thinking more about it, the solution you are proposing is more a
bandaid than anything else.

If you look at the networking NAPI mechanism. It handles that situation
gracefully by:

  - Disabling the interrupt at the device level

  - Polling the device in softirq context until empty and then reenabling
    interrupts

  - In case the softirq handles more packets than a defined budget it
    forces the softirq into the softirqd thread context which also
    allows rescheduling once the budget is completed.

With your adhoc workaround you handle one specific case. But it does not
work at all when an overload situation occurs in a case where the queues
are truly per cpu simply. Because then the interrupt and the thread
affinity are the same and single CPU targets and you replace the interrupt
with a threaded handler which runs by default with RT priority.

So instead of hacking something half baken into the hard/softirq code, why
can't block do a budget limitation and once that is reached switch to
something NAPI like as a general solution?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  8:09                           ` Thomas Gleixner
@ 2019-09-03  9:24                             ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-03  9:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, LKML, Long Li, Ingo Molnar, Peter Zijlstra,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	John Garry, Hannes Reinecke, linux-nvme, linux-scsi

On Tue, Sep 03, 2019 at 10:09:57AM +0200, Thomas Gleixner wrote:
> On Tue, 3 Sep 2019, Ming Lei wrote:
> > Scheduler can do nothing if the CPU is taken completely by handling
> > interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Well, but thinking more about it, the solution you are proposing is more a
> bandaid than anything else.
> 
> If you look at the networking NAPI mechanism. It handles that situation
> gracefully by:
> 
>   - Disabling the interrupt at the device level

I guess you mean we disable the interrupt in the softirq context.

IO performance could be affected by the extra action of disabling/enabling
interrupt every time.

IOPS for the discussed device is several millions.

> 
>   - Polling the device in softirq context until empty and then reenabling
>     interrupts

blk-mq switches to complete req in interrupt context for avoiding extra
performance loss, so switching back to softirq context every time may
cause performance regression.

> 
>   - In case the softirq handles more packets than a defined budget it
>     forces the softirq into the softirqd thread context which also
>     allows rescheduling once the budget is completed.


It can be hard to figure out one perfect defined budget.

In the patchset of V2[1], IRQF_ONESHOT is applied on the irq thread,
and interrupt isn't enabled until the interrupt has been handled in
the irq thread context.

[1] https://github.com/ming1/linux/commits/v5.3-genirq-for-5.4

The approach in this patchset is actually very similar with the above
NAPI based way. The difference is that softirq is avoided, and interrupt
is always handled in interrupt context in case that CPU won't be stalled,
so performance won't be affected. And we only switch to handle interrupt
in thread context if CPU stall is going to happen.

> 
> With your adhoc workaround you handle one specific case. But it does not
> work at all when an overload situation occurs in a case where the queues
> are truly per cpu simply.

There isn't such CPU stall issue in case of single submission vs. single
completion, because submission side and completion side share same single
CPU, and the submission side will slow down if completion side takes all
the CPU. 

> Because then the interrupt and the thread
> affinity are the same and single CPU targets and you replace the interrupt
> with a threaded handler which runs by default with RT priority.

Even though the threaded handler is RT priority and the thread is run
on same CPU with the interrupt, CPU/rcu stall still can be avoided.

Also we can switch to use irq affinity for the irq thread instead of effective
affinity.

> 
> So instead of hacking something half baken into the hard/softirq code, why
> can't block do a budget limitation and once that is reached switch to
> something NAPI like as a general solution?

Another big reason is that multiple submission vs. single completion isn't
common case, I knew that there are only small number of such device,
so re-inventing NAPI based approach may takes lots of effort, meantime
only small number of devices can get the benefit, not sure if block
community would like to consider that.

IMO, it might be the simplest generic way to solve the problem from genirq.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:50                           ` Daniel Lezcano
@ 2019-09-03  9:30                             ` Ming Lei
  2019-09-04 17:07                             ` Bart Van Assche
  1 sibling, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-03  9:30 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Tue, Sep 03, 2019 at 09:50:06AM +0200, Daniel Lezcano wrote:
> On 03/09/2019 09:28, Ming Lei wrote:
> > On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >> On 03/09/2019 08:31, Ming Lei wrote:
> >>> Hi Daniel,
> >>>
> >>> On Tue, Sep 03, 2019 at 07:59:39AM +0200, Daniel Lezcano wrote:
> >>>>
> >>>> Hi Ming Lei,
> >>>>
> >>>> On 03/09/2019 05:30, Ming Lei wrote:
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>
> >>>>>>> 2) irq/timing doesn't cover softirq
> >>>>>>
> >>>>>> That's solvable, right?
> >>>>>
> >>>>> Yeah, we can extend irq/timing, but ugly for irq/timing, since irq/timing
> >>>>> focuses on hardirq predication, and softirq isn't involved in that
> >>>>> purpose.
> >>>>>
> >>>>>>  
> >>>>>>> Daniel, could you take a look and see if irq flood detection can be
> >>>>>>> implemented easily by irq/timing.c?
> >>>>>>
> >>>>>> I assume you can take a look as well, right?
> >>>>>
> >>>>> Yeah, I have looked at the code for a while, but I think that irq/timing
> >>>>> could become complicated unnecessarily for covering irq flood detection,
> >>>>> meantime it is much less efficient for detecting IRQ flood.
> >>>>
> >>>> In the series, there is nothing describing rigorously the problem (I can
> >>>> only guess) and why the proposed solution solves it.
> >>>>
> >>>> What is your definition of an 'irq flood'? A high irq load? An irq
> >>>> arriving while we are processing the previous one in the bottom halves?
> >>>
> >>> So far, it means that handling interrupt & softirq takes all utilization
> >>> of one CPU, then processes can't be run on this CPU basically, usually
> >>> sort of CPU lockup warning will be triggered.
> >>
> >> It is a scheduler problem then ?
> > 
> > Scheduler can do nothing if the CPU is taken completely by handling
> > interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Why? If there is a irq pressure on one CPU reducing its capacity, the
> scheduler will balance the tasks on another CPU, no?

For example, the following two kinds[1][2] of warning can be triggered
easily when heavy fio test is run on NVMe. The 1st one could be that
network irq is affected by the nvme irq flood. The 2nd one is rcu_sched stall.

[1]
[  186.531069] NETDEV WATCHDOG: enP48661s1 (mlx4_core): transmit queue 12 timed out
[  186.545259] WARNING: CPU: 9 PID: 0 at net/sched/sch_generic.c:443 dev_watchdog+0x252/0x260
[  186.546222] Modules linked in: xt_owner ext4 mbcache jbd2 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd mlx4_en mlx4_core nvme nvme_core hv_utils sg pcspkr i2c_piix4 ptp hv_balloon pci_hyperv pps_core joydev ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi hv_storvsc hv_netvsc hyperv_keyboard hid_hyperv scsi_transport_fc ata_piix hyperv_fb crc32c_intel libata hv_vmbus floppy serio_raw
[  186.546222] CPU: 9 PID: 0 Comm: swapper/9 Not tainted 5.3.0-rc6+ #17
[  186.546222] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  186.546222] RIP: 0010:dev_watchdog+0x252/0x260
[  186.546222] Code: c0 75 e4 eb 98 4c 89 ef c6 05 38 ca c7 00 01 e8 04 93 fb ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 f0 9e 54 be 31 c0 e8 4e 14 95 ff <0f> 0b e9 75 ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 57 ba
[  186.546222] RSP: 0018:ffffae5518ee0e88 EFLAGS: 00010282
[  186.546222] RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000000
[  186.546222] RDX: 0000000000000001 RSI: ffff9d6b5f8577b8 RDI: ffff9d6b5f8577b8
[  186.546222] RBP: ffff9d5b4c300440 R08: 0000000000000363 R09: 0000000000000363
[  186.546222] R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000100
[  186.546222] R13: ffff9d5b4c300000 R14: ffff9d5b4c30041c R15: 0000000000000009
[  186.546222] FS:  0000000000000000(0000) GS:ffff9d6b5f840000(0000) knlGS:0000000000000000
[  186.546222] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  186.546222] CR2: 00007f6324aa19b8 CR3: 000000308981c000 CR4: 00000000003406e0
[  186.546222] Call Trace:
[  186.546222]  <IRQ>
[  186.546222]  ? pfifo_fast_enqueue+0x110/0x110
[  186.546222]  call_timer_fn+0x2f/0x140
[  186.546222]  run_timer_softirq+0x1f6/0x460
[  186.546222]  ? timerqueue_add+0x54/0x80
[  186.546222]  ? enqueue_hrtimer+0x38/0x90
[  186.546222]  __do_softirq+0xda/0x2a8
[  186.546222]  irq_exit+0xc5/0xd0
[  186.546222]  hv_stimer0_vector_handler+0x5a/0x70
[  186.546222]  hv_stimer0_callback_vector+0xf/0x20
[  186.546222]  </IRQ>
[  186.546222] RIP: 0010:native_safe_halt+0xe/0x10
[  186.546222] Code: 01 00 f0 80 48 02 20 48 8b 00 a8 08 0f 84 7a ff ff ff eb bc 90 90 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d a6 74 59 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 96 74 59 00 f4 c3 90 90 0f 1f 44 00
[  186.546222] RSP: 0018:ffffae5518993eb0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  186.546222] RAX: ffffffffbdc72970 RBX: ffff9d5ba5b79700 RCX: 0000000000000000
[  186.546222] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000002af36665fe
[  186.546222] RBP: 0000000000000009 R08: 0000000000000000 R09: fffffffe56427f06
[  186.546222] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  186.546222] R13: 0000000000000000 R14: ffff9d5ba5b79700 R15: ffff9d5ba5b79700
[  186.546222]  ? __cpuidle_text_start+0x8/0x8
[  186.546222]  default_idle+0x1a/0x140
[  186.546222]  do_idle+0x1a6/0x290
[  186.546222]  cpu_startup_entry+0x19/0x20
[  186.546222]  start_secondary+0x166/0x1c0
[  186.546222]  secondary_startup_64+0xa4/0xb0
[  186.546222] ---[ end trace 8d7ef273033e5d7a ]---

[2]
[  247.296117] rcu: INFO: rcu_sched self-detected stall on CPU
[  247.441146] rcu: 	64-....: (1 GPs behind) idle=1fe/1/0x4000000000000002 softirq=498/499 fqs=13379 
[  247.588155] 	(t=60291 jiffies g=6261 q=16750)
[  247.726156] NMI backtrace for cpu 64
[  247.862150] CPU: 64 PID: 7772 Comm: fio Tainted: G        W         5.3.0-rc6+ #17
[  248.004148] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  248.150158] Call Trace:
[  248.279195]  <IRQ>
[  248.401151]  dump_stack+0x5a/0x73
[  248.522126]  nmi_cpu_backtrace+0x90/0xa0
[  248.640147]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  248.757151]  nmi_trigger_cpumask_backtrace+0x6c/0x110
[  248.871145]  rcu_dump_cpu_stacks+0x8a/0xb8
[  248.980152]  rcu_sched_clock_irq+0x55a/0x7a0
[  249.089152]  ? smp_call_function_single_async+0x1f/0x40
[  249.197151]  ? blk_mq_complete_request+0x8e/0xf0
[  249.301157]  ? tick_sched_do_timer+0x80/0x80
[  249.404177]  update_process_times+0x28/0x50
[  249.505152]  tick_sched_handle+0x25/0x60
[  249.605151]  tick_sched_timer+0x37/0x70
[  249.705149]  __hrtimer_run_queues+0xfb/0x270
[  249.804160]  hrtimer_interrupt+0x122/0x270
[  249.904166]  hv_stimer0_vector_handler+0x33/0x70
[  250.004148]  hv_stimer0_callback_vector+0xf/0x20
[  250.102144]  </IRQ>
[  250.195164] RIP: 0010:nvme_queue_rq+0x42/0x710 [nvme]
[  250.294147] Code: 68 48 8b a8 c8 00 00 00 48 8b 97 b8 00 00 00 48 8b 1e 65 48 8b 0c 25 28 00 00 00 48 89 4c 24 60 31 c9 48 8b 7a 70 4c 8b 6d 00 <c7> 83 3c 01 00 00 00 00 00 00 c7 83 40 01 00 00 ff ff ff ff c7 83
[  250.518141] RSP: 0018:ffffae551c7ef8f8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  250.630150] RAX: ffff9d5b49fb6c00 RBX: ffff9d5b5aaf4500 RCX: 0000000000000000
[  250.743121] RDX: ffff9d5b4a719020 RSI: ffffae551c7ef9a8 RDI: ffff9d5b5e8e2e80
[  250.856148] RBP: ffff9d5b4ba20100 R08: 0000000000000001 R09: ffff9d4bc51ef080
[  250.970154] R10: 000000000000007f R11: ffff9d5b4a719020 R12: ffffae551c7ef9a8
[  251.084177] R13: ffff9d5b5e430000 R14: ffffae551c7efab0 R15: ffffce44ffa0bbc0
[  251.200157]  ? __sbitmap_get_word+0x2a/0x80
[  251.311146]  ? sbitmap_get+0x61/0x130
[  251.421145]  __blk_mq_try_issue_directly+0x143/0x1f0
[  251.533144]  blk_mq_request_issue_directly+0x4d/0xa0
[  251.646148]  blk_mq_try_issue_list_directly+0x51/0xc0
[  251.758155]  blk_mq_sched_insert_requests+0xbb/0x110
[  251.865149]  blk_mq_flush_plug_list+0x191/0x2c0
[  251.971149]  blk_flush_plug_list+0xcc/0xf0
[  252.074146]  blk_finish_plug+0x27/0x40
[  252.177154]  blkdev_direct_IO+0x410/0x4a0
[  252.280155]  generic_file_read_iter+0xb1/0xc70
[  252.384166]  ? common_interrupt+0xa/0xf
[  252.486149]  ? _cond_resched+0x15/0x30
[  252.588168]  aio_read+0xf6/0x150
[  252.688145]  ? common_interrupt+0xa/0xf
[  252.789147]  ? __switch_to_asm+0x40/0x70
[  252.892151]  ? __switch_to_asm+0x34/0x70
[  252.994125]  ? __check_object_size+0x75/0x1b0
[  253.097148]  ? _copy_to_user+0x22/0x30
[  253.198147]  ? aio_read_events+0x279/0x300
[  253.298149]  ? kmem_cache_alloc+0x15c/0x200
[  253.396155]  io_submit_one+0x199/0xb50
[  253.492163]  ? read_events+0x5c/0x160
[  253.588152]  ? __switch_to_asm+0x40/0x70
[  253.683146]  ? __switch_to_asm+0x34/0x70
[  253.773145]  ? __switch_to_asm+0x40/0x70
[  253.863148]  ? __switch_to_asm+0x34/0x70
[  253.950431]  ? __switch_to_asm+0x40/0x70
[  254.037146]  __x64_sys_io_submit+0xb3/0x1a0
[  254.125153]  ? __audit_syscall_exit+0x1d9/0x280
[  254.214149]  do_syscall_64+0x5b/0x1d0
[  254.300154]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  254.390145] RIP: 0033:0x7f63580f1697
[  254.478137] Code: 49 83 38 00 75 eb 49 83 78 08 00 75 e4 8b 47 0c 39 47 08 75 dc 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 90 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84 00 00
[  254.682145] RSP: 002b:00007ffe5fee3d18 EFLAGS: 00000293 ORIG_RAX: 00000000000000d1
[  254.784142] RAX: ffffffffffffffda RBX: 0000000001defb90 RCX: 00007f63580f1697
[  254.885146] RDX: 0000000001defdf8 RSI: 0000000000000001 RDI: 00007f636c142000
[  254.989153] RBP: 0000000000000000 R08: 0000000000000010 R09: 0000000001dee3e0
[  255.092126] R10: 0000000001dde000 R11: 0000000000000293 R12: 0000000000000001
[  255.194148] R13: 0000000000000018 R14: 00007f632cb45980 R15: 0000000001defe70
[  299.228276] nvme nvme7: I/O 220 QID 2 timeout, aborting
[  299.298199] nvme nvme7: Abort status: 0x0
[  300.425912] mlx4_en: enP48661s1: Steering Mode 2
[  438.694592] rcu: INFO: rcu_sched self-detected stall on CPU
[  438.783121] rcu: 	77-....: (1 GPs behind) idle=c1e/1/0x4000000000000004 softirq=795/796 fqs=12978 
[  438.929120] 	(t=60235 jiffies g=6541 q=3725)
[  439.029124] NMI backtrace for cpu 77
[  439.130120] CPU: 77 PID: 7718 Comm: fio Tainted: G        W         5.3.0-rc6+ #17
[  439.245123] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[  439.357125] Call Trace:
[  439.467157]  <IRQ>
[  439.522121]  dump_stack+0x5a/0x73
[  439.670123]  nmi_cpu_backtrace+0x90/0xa0
[  439.767124]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  439.877119]  nmi_trigger_cpumask_backtrace+0x6c/0x110
[  439.987122]  rcu_dump_cpu_stacks+0x8a/0xb8
[  440.087121]  rcu_sched_clock_irq+0x55a/0x7a0
[  440.186126]  ? smp_call_function_single_async+0x1f/0x40
[  440.293125]  ? blk_mq_complete_request+0x8e/0xf0
[  440.399122]  ? tick_sched_do_timer+0x80/0x80
[  440.505118]  update_process_times+0x28/0x50
[  440.607119]  tick_sched_handle+0x25/0x60
[  440.709122]  tick_sched_timer+0x37/0x70
[  440.814119]  __hrtimer_run_queues+0xfb/0x270
[  440.919122]  hrtimer_interrupt+0x122/0x270
[  441.024116]  hv_stimer0_vector_handler+0x33/0x70
[  441.108126]  hv_stimer0_callback_vector+0xf/0x20
[  441.220121] RIP: 0010:__do_softirq+0x77/0x2a8
[  441.319130] Code: 0f b7 ca 89 4c 24 04 c7 44 24 20 0a 00 00 00 48 89 44 24 08 48 89 44 24 18 65 66 c7 05 d0 a5 02 42 00 00 fb 66 0f 1f 44 00 00 <b8> ff ff ff ff 48 c7 c5 00 51 60 be 0f bc 44 24 04 83 c0 01 89 04
[  441.552119] RSP: 0018:ffffae5519cc0f78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff12
[  441.698119] RAX: ffff9d7b5ac7dc00 RBX: 0000000000000000 RCX: 0000000000000002
[  441.826119] RDX: ffff9ddba50d0002 RSI: 0000000000000000 RDI: 0000000000000000
[  441.943122] RBP: 0000000000000000 R08: 0000000000000000 R09: 01484a5406289ee5
[  442.052121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  442.180121] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  442.307119]  ? hv_stimer0_callback_vector+0xa/0x20
[  442.423123]  ? clockevents_program_event+0x79/0xf0
[  442.543117]  irq_exit+0xc5/0xd0
[  442.655124]  hv_stimer0_vector_handler+0x5a/0x70
[  442.783119]  hv_stimer0_callback_vector+0xf/0x20
[  442.896120]  </IRQ>
[  443.004119] RIP: 0010:__blk_mq_try_issue_directly+0x157/0x1f0


thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-03  7:50                           ` Daniel Lezcano
  2019-09-03  9:30                             ` Ming Lei
@ 2019-09-04 17:07                             ` Bart Van Assche
  2019-09-04 17:31                               ` Daniel Lezcano
  1 sibling, 1 reply; 64+ messages in thread
From: Bart Van Assche @ 2019-09-04 17:07 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> On 03/09/2019 09:28, Ming Lei wrote:
>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>> It is a scheduler problem then ?
>>
>> Scheduler can do nothing if the CPU is taken completely by handling
>> interrupt & softirq, so seems not a scheduler problem, IMO.
> 
> Why? If there is a irq pressure on one CPU reducing its capacity, the
> scheduler will balance the tasks on another CPU, no?

Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't 
know any Linux distro that enables that option. That's probably because 
that option introduces two rdtsc() calls in each interrupt. Given the 
overhead introduced by this option, I don't think this is the solution 
Ming is looking for.

See also irqtime_account_irq() in kernel/sched/cputime.c.

Bart.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:07                             ` Bart Van Assche
@ 2019-09-04 17:31                               ` Daniel Lezcano
  2019-09-04 17:38                                 ` Bart Van Assche
  2019-09-05  9:06                                 ` Ming Lei
  0 siblings, 2 replies; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-04 17:31 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

Hi,

On 04/09/2019 19:07, Bart Van Assche wrote:
> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>> On 03/09/2019 09:28, Ming Lei wrote:
>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>> It is a scheduler problem then ?
>>>
>>> Scheduler can do nothing if the CPU is taken completely by handling
>>> interrupt & softirq, so seems not a scheduler problem, IMO.
>>
>> Why? If there is a irq pressure on one CPU reducing its capacity, the
>> scheduler will balance the tasks on another CPU, no?
> 
> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> know any Linux distro that enables that option. That's probably because
> that option introduces two rdtsc() calls in each interrupt. Given the
> overhead introduced by this option, I don't think this is the solution
> Ming is looking for.

Was this overhead reported somewhere ?

> See also irqtime_account_irq() in kernel/sched/cputime.c.

From my POV, this framework could be interesting to detect this situation.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:31                               ` Daniel Lezcano
@ 2019-09-04 17:38                                 ` Bart Van Assche
  2019-09-04 18:02                                   ` Peter Zijlstra
  2019-09-05  9:06                                 ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Bart Van Assche @ 2019-09-04 17:38 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Sagi Grimberg, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/4/19 10:31 AM, Daniel Lezcano wrote:
> On 04/09/2019 19:07, Bart Van Assche wrote:
>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
>> know any Linux distro that enables that option. That's probably because
>> that option introduces two rdtsc() calls in each interrupt. Given the
>> overhead introduced by this option, I don't think this is the solution
>> Ming is looking for.
> 
> Was this overhead reported somewhere ?
  I think it is widely known that rdtsc is a relatively slow x86 
instruction. So I expect that using that instruction will cause a 
measurable overhead if it is called frequently enough. I'm not aware of 
any publicly available measurement data however.

Bart.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:38                                 ` Bart Van Assche
@ 2019-09-04 18:02                                   ` Peter Zijlstra
  2019-09-04 19:47                                     ` Bart Van Assche
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2019-09-04 18:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Daniel Lezcano, Ming Lei, Keith Busch, Hannes Reinecke,
	Sagi Grimberg, linux-scsi, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
> On 9/4/19 10:31 AM, Daniel Lezcano wrote:
> > On 04/09/2019 19:07, Bart Van Assche wrote:
> > > Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> > > know any Linux distro that enables that option. That's probably because
> > > that option introduces two rdtsc() calls in each interrupt. Given the
> > > overhead introduced by this option, I don't think this is the solution
> > > Ming is looking for.
> > 
> > Was this overhead reported somewhere ?
>  I think it is widely known that rdtsc is a relatively slow x86 instruction.
> So I expect that using that instruction will cause a measurable overhead if
> it is called frequently enough. I'm not aware of any publicly available
> measurement data however.

https://www.agner.org/optimize/instruction_tables.pdf

RDTSC, Ryzen: ~36
RDTSC, Skylake: ~20

Sadly those same tables don't list the cost of actual exceptions or even
IRET :/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 18:02                                   ` Peter Zijlstra
@ 2019-09-04 19:47                                     ` Bart Van Assche
  2019-09-05  9:11                                       ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Bart Van Assche @ 2019-09-04 19:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Lezcano, Ming Lei, Keith Busch, Hannes Reinecke,
	Sagi Grimberg, linux-scsi, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On 9/4/19 11:02 AM, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
>> I think it is widely known that rdtsc is a relatively slow x86 instruction.
>> So I expect that using that instruction will cause a measurable overhead if
>> it is called frequently enough. I'm not aware of any publicly available
>> measurement data however.
> 
> https://www.agner.org/optimize/instruction_tables.pdf
> 
> RDTSC, Ryzen: ~36
> RDTSC, Skylake: ~20
> 
> Sadly those same tables don't list the cost of actual exceptions or even
> IRET :/

Thanks Peter for having looked up these numbers. These numbers are much 
better than last time I checked. Ming, would CONFIG_IRQ_TIME_ACCOUNTING 
help your workload?

Does anyone know which CPUs the following text from 
Documentation/admin-guide/kernel-parameters.txt refers to?

tsc=		[ ... ]
		[x86] noirqtime: Do not use TSC to do irq accounting.
		Used to run time disable IRQ_TIME_ACCOUNTING on any
		platforms where RDTSC is slow and this accounting
		can add overhead.

Bart.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 17:31                               ` Daniel Lezcano
  2019-09-04 17:38                                 ` Bart Van Assche
@ 2019-09-05  9:06                                 ` Ming Lei
  2019-09-05 10:37                                   ` Daniel Lezcano
  1 sibling, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-09-05  9:06 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Bart Van Assche, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
> Hi,
> 
> On 04/09/2019 19:07, Bart Van Assche wrote:
> > On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> >> On 03/09/2019 09:28, Ming Lei wrote:
> >>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >>>> It is a scheduler problem then ?
> >>>
> >>> Scheduler can do nothing if the CPU is taken completely by handling
> >>> interrupt & softirq, so seems not a scheduler problem, IMO.
> >>
> >> Why? If there is a irq pressure on one CPU reducing its capacity, the
> >> scheduler will balance the tasks on another CPU, no?
> > 
> > Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> > know any Linux distro that enables that option. That's probably because
> > that option introduces two rdtsc() calls in each interrupt. Given the
> > overhead introduced by this option, I don't think this is the solution
> > Ming is looking for.
> 
> Was this overhead reported somewhere ?

The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
calls tk_clock_read() which calls rdtsc too.

But gettimeofday() is often used in fast path, and block IO_STAT needs to
read it too.

> 
> > See also irqtime_account_irq() in kernel/sched/cputime.c.
> 
> From my POV, this framework could be interesting to detect this situation.

Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
former one could be used to implement the detection. And the only sharing
should be the read of timestamp.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-04 19:47                                     ` Bart Van Assche
@ 2019-09-05  9:11                                       ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-05  9:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Peter Zijlstra, Daniel Lezcano, Keith Busch, Hannes Reinecke,
	Sagi Grimberg, linux-scsi, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

On Wed, Sep 04, 2019 at 12:47:13PM -0700, Bart Van Assche wrote:
> On 9/4/19 11:02 AM, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2019 at 10:38:59AM -0700, Bart Van Assche wrote:
> > > I think it is widely known that rdtsc is a relatively slow x86 instruction.
> > > So I expect that using that instruction will cause a measurable overhead if
> > > it is called frequently enough. I'm not aware of any publicly available
> > > measurement data however.
> > 
> > https://www.agner.org/optimize/instruction_tables.pdf
> > 
> > RDTSC, Ryzen: ~36
> > RDTSC, Skylake: ~20
> > 
> > Sadly those same tables don't list the cost of actual exceptions or even
> > IRET :/
> 
> Thanks Peter for having looked up these numbers. These numbers are much
> better than last time I checked. Ming, would CONFIG_IRQ_TIME_ACCOUNTING help
> your workload?

In my fio test on azure L80sv2, IRQ_TIME_ACCOUNTING isn't enabled.
However the irq flood detection introduces two RDTSC for each do_IRQ(),
not see obvious IOPS difference.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05  9:06                                 ` Ming Lei
@ 2019-09-05 10:37                                   ` Daniel Lezcano
  2019-09-06  1:22                                     ` Long Li
  2019-09-06  1:48                                     ` Ming Lei
  0 siblings, 2 replies; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-05 10:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig


Hi Ming,

On 05/09/2019 11:06, Ming Lei wrote:
> On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
>> Hi,
>>
>> On 04/09/2019 19:07, Bart Van Assche wrote:
>>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>>>> On 03/09/2019 09:28, Ming Lei wrote:
>>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>>>> It is a scheduler problem then ?
>>>>>
>>>>> Scheduler can do nothing if the CPU is taken completely by handling
>>>>> interrupt & softirq, so seems not a scheduler problem, IMO.
>>>>
>>>> Why? If there is a irq pressure on one CPU reducing its capacity, the
>>>> scheduler will balance the tasks on another CPU, no?
>>>
>>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
>>> know any Linux distro that enables that option. That's probably because
>>> that option introduces two rdtsc() calls in each interrupt. Given the
>>> overhead introduced by this option, I don't think this is the solution
>>> Ming is looking for.
>>
>> Was this overhead reported somewhere ?
> 
> The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
> calls tk_clock_read() which calls rdtsc too.
> 
> But gettimeofday() is often used in fast path, and block IO_STAT needs to
> read it too.
> 
>>
>>> See also irqtime_account_irq() in kernel/sched/cputime.c.
>>
>> From my POV, this framework could be interesting to detect this situation.
> 
> Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
> former one could be used to implement the detection. And the only sharing
> should be the read of timestamp.

You did not share yet the analysis of the problem (the kernel warnings
give the symptoms) and gave the reasoning for the solution. It is hard
to understand what you are looking for exactly and how to connect the dots.

AFAIU, there are fast medium where the responses to requests are faster
than the time to process them, right?

I don't see how detecting IRQ flooding and use a threaded irq is the
solution, can you explain?

If the responses are coming at a very high rate, whatever the solution
(interrupts, threaded interrupts, polling), we are still in the same
situation.

My suggestion was initially to see if the interrupt load will be taken
into accounts in the cpu load and favorize task migration with the
scheduler load balance to a less loaded CPU, thus the CPU processing
interrupts will end up doing only that while other CPUs will handle the
"threaded" side.

Beside that, I'm wondering if the block scheduler should be somehow
involved in that [1]

  -- Daniel

[1]
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05 10:37                                   ` Daniel Lezcano
@ 2019-09-06  1:22                                     ` Long Li
  2019-09-06  4:36                                       ` Daniel Lezcano
  2019-09-06  1:48                                     ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Long Li @ 2019-09-06  1:22 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Bart Van Assche, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>
>Hi Ming,
>
>On 05/09/2019 11:06, Ming Lei wrote:
>> On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
>>> Hi,
>>>
>>> On 04/09/2019 19:07, Bart Van Assche wrote:
>>>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
>>>>> On 03/09/2019 09:28, Ming Lei wrote:
>>>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
>>>>>>> It is a scheduler problem then ?
>>>>>>
>>>>>> Scheduler can do nothing if the CPU is taken completely by
>>>>>> handling interrupt & softirq, so seems not a scheduler problem, IMO.
>>>>>
>>>>> Why? If there is a irq pressure on one CPU reducing its capacity,
>>>>> the scheduler will balance the tasks on another CPU, no?
>>>>
>>>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I
>>>> don't know any Linux distro that enables that option. That's
>>>> probably because that option introduces two rdtsc() calls in each
>>>> interrupt. Given the overhead introduced by this option, I don't
>>>> think this is the solution Ming is looking for.
>>>
>>> Was this overhead reported somewhere ?
>>
>> The syscall of gettimeofday() calls ktime_get_real_ts64() which
>> finally calls tk_clock_read() which calls rdtsc too.
>>
>> But gettimeofday() is often used in fast path, and block IO_STAT needs
>> to read it too.
>>
>>>
>>>> See also irqtime_account_irq() in kernel/sched/cputime.c.
>>>
>>> From my POV, this framework could be interesting to detect this situation.
>>
>> Now we are talking about IRQ_TIME_ACCOUNTING instead of
>IRQ_TIMINGS,
>> and the former one could be used to implement the detection. And the
>> only sharing should be the read of timestamp.
>
>You did not share yet the analysis of the problem (the kernel warnings give
>the symptoms) and gave the reasoning for the solution. It is hard to
>understand what you are looking for exactly and how to connect the dots.
>
>AFAIU, there are fast medium where the responses to requests are faster
>than the time to process them, right?
>
>I don't see how detecting IRQ flooding and use a threaded irq is the solution,
>can you explain?
>
>If the responses are coming at a very high rate, whatever the solution
>(interrupts, threaded interrupts, polling), we are still in the same situation.
>
>My suggestion was initially to see if the interrupt load will be taken into
>accounts in the cpu load and favorize task migration with the scheduler load
>balance to a less loaded CPU, thus the CPU processing interrupts will end up
>doing only that while other CPUs will handle the "threaded" side.
>
>Beside that, I'm wondering if the block scheduler should be somehow
>involved in that [1]
>
>  -- Daniel

Hi Daniel

I want to share some test results with IRQ_TIME_ACCOUNTING. During tests, the kernel had warnings on RCU stall and soft lockup:

An example of RCU stall;
[ 3016.148250] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3016.148299] rcu:     66-....: (1 GPs behind) idle=cc2/0/0x3 softirq=10037/10037 fqs=4717
[ 3016.148299]  (detected by 27, t=15011 jiffies, g=45173, q=17194)
[ 3016.148299] Sending NMI from CPU 27 to CPUs 66:
[ 3016.148299] NMI backtrace for cpu 66
[ 3016.148299] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G             L    5.3.0-rc6+ #68
[ 3016.148299] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[ 3016.148299] RIP: 0010:0xffff9c4740013003
[ 3016.148299] Code: Bad RIP value.
[ 3016.148299] RSP: 0018:ffff9c4759acc8d0 EFLAGS: 00000046
[ 3016.148299] RAX: 0000000000000000 RBX: 0000000000000080 RCX: 000000000001000b
[ 3016.148299] RDX: 00000000000000fb RSI: ffff9c4740013000 RDI: ffff9c4759acc920
[ 3016.148299] RBP: ffff9c4759acc920 R08: 0000000000000008 R09: 01484a845c6de350
[ 3016.148299] R10: ffff9c4759accd30 R11: 0000000000000001 R12: 00000000000000fb
[ 3016.148299] R13: 0000000000000042 R14: ffff8a7d9b771f80 R15: 00000000000001e1
[ 3016.148299] FS:  0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3016.148299] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3016.148299] CR2: ffff9c4740012fd9 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3016.148299] Call Trace:
[ 3016.148299]  <IRQ>
[ 3016.148299]  ? __send_ipi_mask+0x145/0x2e0
[ 3016.148299]  ? __send_ipi_one+0x3a/0x60
[ 3016.148299]  ? hv_send_ipi+0x10/0x30
[ 3016.148299]  ? generic_exec_single+0x63/0xe0
[ 3016.148299]  ? smp_call_function_single_async+0x1f/0x40
[ 3016.148299]  ? blk_mq_complete_request+0xdf/0xf0
[ 3016.148299]  ? nvme_irq+0x144/0x240 [nvme]
[ 3016.148299]  ? tick_sched_do_timer+0x80/0x80
[ 3016.148299]  ? __handle_irq_event_percpu+0x40/0x190
[ 3016.148299]  ? handle_irq_event_percpu+0x30/0x70
[ 3016.148299]  ? handle_irq_event+0x36/0x60
[ 3016.148299]  ? handle_edge_irq+0x7e/0x190
[ 3016.148299]  ? handle_irq+0x1c/0x30
[ 3016.148299]  ? do_IRQ+0x49/0xd0
[ 3016.148299]  ? common_interrupt+0xf/0xf
[ 3016.148299]  ? common_interrupt+0xa/0xf
[ 3016.148299]  ? __do_softirq+0x76/0x2e3
[ 3016.148299]  ? __do_softirq+0x4b/0x2e3
[ 3016.148299]  ? sched_clock_local+0x12/0x80
[ 3016.148299]  ? irq_exit+0xdd/0xf0
[ 3016.148299]  ? hv_stimer0_vector_handler+0x62/0x70
[ 3016.148299]  ? hv_stimer0_callback_vector+0xf/0x20
[ 3016.148299]  </IRQ>
[ 3016.148299]  ? __sched_text_end+0x2/0x2
[ 3016.148299]  ? default_idle+0x25/0x150
[ 3016.148299]  ? do_idle+0x1ad/0x250
[ 3016.148299]  ? cpu_startup_entry+0x19/0x20
[ 3016.148299]  ? start_secondary+0x156/0x1a0
[ 3016.148299]  ? secondary_startup_64+0xa4/0xb0

An example of soft lockup:
[ 3082.490820] watchdog: BUG: soft lockup - CPU#66 stuck for 22s! [swapper/66:0]
[ 3082.501353] Modules linked in: xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_security nls_iso8859_1 nvme nvme_core pci_hyperv hv_balloon serio_raw joydev sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel hyperv_fb hid_generic aes_x86_64 crypto_simd hid_hyperv cfbfillrect cryptd hv_netvsc glue_helper cfbimgblt hid hv_utils pata_acpi hyperv_keyboard cfbcopyarea
[ 3082.501353] CPU: 66 PID: 0 Comm: swapper/66 Tainted: G             L    5.3.0-rc6+ #68
[ 3082.501353] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  05/18/2018
[ 3082.501353] RIP: 0010:__do_softirq+0x76/0x2e3
[ 3082.501353] Code: 81 05 72 5c e1 55 00 01 00 00 c7 44 24 20 0a 00 00 00 44 89 74 24 04 48 c7 c0 80 98 02 00 65 66 c7 00 00 00 fb b8 ff ff ff ff <0f> bc 44 24 04 83 c0 01 89 44 24 10 0f 84 a0 00 00 00 48 c7 44 24
[ 3082.501353] RSP: 0018:ffff9c4759accf78 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000002aff0e90b1d
[ 3082.501353] RDX: 00000000000000b5 RSI: 000002aff0e90bd2 RDI: 00000000000000b5
[ 3082.501353] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4759accf20 R11: 0000000002b140e3 R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000000
[ 3082.501353] FS:  0000000000000000(0000) GS:ffff8afd9f880000(0000) knlGS:0000000000000000
[ 3082.501353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3082.501353] CR2: 0000000000638a60 CR3: 000000208b9bc000 CR4: 00000000003406e0
[ 3082.501353] Call Trace:
[ 3082.501353]  <IRQ>
[ 3082.501353]  ? sched_clock_local+0x12/0x80
[ 3082.501353]  irq_exit+0xdd/0xf0
[ 3082.501353]  hv_stimer0_vector_handler+0x62/0x70
[ 3082.501353]  hv_stimer0_callback_vector+0xf/0x20
[ 3082.501353]  </IRQ>
[ 3082.501353] RIP: 0010:default_idle+0x25/0x150
[ 3082.501353] Code: f2 75 ff 90 90 0f 1f 44 00 00 41 55 41 54 55 53 65 8b 2d 7e 49 0f 56 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 5f 9a 4e 00 fb f4 <65> 8b 2d 64 49 0f 56 0f 1f 44 00 00 5b 5d 41 5c 41 5d c3 65 8b 05
[ 3082.501353] RSP: 0018:ffff9c4758b5bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[ 3082.501353] RAX: ffffffffa9f1b9c0 RBX: 0000000000000042 RCX: 0000000000000000
[ 3082.501353] RDX: 0000000000000042 RSI: 0000000000000000 RDI: 000002af5fd2e9ef
[ 3082.501353] RBP: 0000000000000042 R08: 0000000000000000 R09: 0000000000029900
[ 3082.501353] R10: ffff9c4758b5be98 R11: 0000000000008a0c R12: 0000000000000000
[ 3082.501353] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3082.501353]  ? __sched_text_end+0x2/0x2
[ 3082.501353]  do_idle+0x1ad/0x250
[ 3082.501353]  cpu_startup_entry+0x19/0x20
[ 3082.501353]  start_secondary+0x156/0x1a0
[ 3082.501353]  secondary_startup_64+0xa4/0xb0

Tracing shows that the CPU was in either hardirq or softirq all the time before warnings. During tests, the system was unresponsive at times.

Ming's patch fixed this problem. The system was responsive throughout tests.

As for performance hit, both resulted in a small drop in peak IOPS. With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1% drop.

For the tests, I used the following fio command on 10 NVMe disks:
fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1


Thanks

Long

>
>[1]
>https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww
>.linaro.org%2Fblog%2Fio-bandwidth-management-for-production-quality-
>services%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27acef
>38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>%7C637032766394916978&amp;sdata=ADJU0iEvTITaPCkLcKsGCbqr2GEbQ8U1
>S81jmarmrMU%3D&amp;reserved=0
>
>
>
>--
>
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27ac
>ef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C
>0%7C637032766394926977&amp;sdata=1IoPpi8C87%2BGPz4RYEMrEYBma0jSe
>Bbd%2FS%2FaAb%2BUKSA%3D&amp;reserved=0> Linaro.org │ Open source
>software for ARM SoCs
>
>Follow Linaro:
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.facebook.com%2Fpages%2FLinaro&amp;data=02%7C01%7Clongli%40microso
>ft.com%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7
>cd011db47%7C1%7C0%7C637032766394926977&amp;sdata=8ycbXr2QuArabxf
>yARomp2ApmadaOG9lmzfC7IT3%2Bj0%3D&amp;reserved=0> Facebook |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftwitt
>er.com%2F%23!%2Flinaroorg&amp;data=02%7C01%7Clongli%40microsoft.co
>m%7C57db27acef38457bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd01
>1db47%7C1%7C0%7C637032766394926977&amp;sdata=aQllrUNHWZ3Vkyp7lJ4
>Rd9Qxx3DMcjYv9AQ7g9ytZzU%3D&amp;reserved=0> Twitter |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2Flinaro-
>blog%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7C57db27acef3845
>7bf8f808d731ed05e6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6
>37032766394926977&amp;sdata=o2SbnrMPZxRA%2Bu51Zzuew2AQhUvcF0TG
>XsZz%2Bzrivb8%3D&amp;reserved=0> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-05 10:37                                   ` Daniel Lezcano
  2019-09-06  1:22                                     ` Long Li
@ 2019-09-06  1:48                                     ` Ming Lei
  2019-09-06  5:14                                       ` Daniel Lezcano
  2019-09-06 14:18                                       ` Keith Busch
  1 sibling, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-06  1:48 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg

Hi Daniel,

On Thu, Sep 05, 2019 at 12:37:13PM +0200, Daniel Lezcano wrote:
> 
> Hi Ming,
> 
> On 05/09/2019 11:06, Ming Lei wrote:
> > On Wed, Sep 04, 2019 at 07:31:48PM +0200, Daniel Lezcano wrote:
> >> Hi,
> >>
> >> On 04/09/2019 19:07, Bart Van Assche wrote:
> >>> On 9/3/19 12:50 AM, Daniel Lezcano wrote:
> >>>> On 03/09/2019 09:28, Ming Lei wrote:
> >>>>> On Tue, Sep 03, 2019 at 08:40:35AM +0200, Daniel Lezcano wrote:
> >>>>>> It is a scheduler problem then ?
> >>>>>
> >>>>> Scheduler can do nothing if the CPU is taken completely by handling
> >>>>> interrupt & softirq, so seems not a scheduler problem, IMO.
> >>>>
> >>>> Why? If there is a irq pressure on one CPU reducing its capacity, the
> >>>> scheduler will balance the tasks on another CPU, no?
> >>>
> >>> Only if CONFIG_IRQ_TIME_ACCOUNTING has been enabled. However, I don't
> >>> know any Linux distro that enables that option. That's probably because
> >>> that option introduces two rdtsc() calls in each interrupt. Given the
> >>> overhead introduced by this option, I don't think this is the solution
> >>> Ming is looking for.
> >>
> >> Was this overhead reported somewhere ?
> > 
> > The syscall of gettimeofday() calls ktime_get_real_ts64() which finally
> > calls tk_clock_read() which calls rdtsc too.
> > 
> > But gettimeofday() is often used in fast path, and block IO_STAT needs to
> > read it too.
> > 
> >>
> >>> See also irqtime_account_irq() in kernel/sched/cputime.c.
> >>
> >> From my POV, this framework could be interesting to detect this situation.
> > 
> > Now we are talking about IRQ_TIME_ACCOUNTING instead of IRQ_TIMINGS, and the
> > former one could be used to implement the detection. And the only sharing
> > should be the read of timestamp.
> 
> You did not share yet the analysis of the problem (the kernel warnings
> give the symptoms) and gave the reasoning for the solution. It is hard
> to understand what you are looking for exactly and how to connect the dots.

Let me explain it one more time:

When one IRQ flood happens on one CPU:

1) softirq handling on this CPU can't make progress

2) kernel thread bound to this CPU can't make progress

For example, network may require softirq to xmit packets, or another irq
thread for handling keyboards/mice or whatever, or rcu_sched may depend
on that CPU for making progress, then the irq flood stalls the whole
system.

> 
> AFAIU, there are fast medium where the responses to requests are faster
> than the time to process them, right?

Usually medium may not be faster than CPU, now we are talking about
interrupts, which can be originated from lots of devices concurrently,
for example, in Long Li'test, there are 8 NVMe drives involved.

> 
> I don't see how detecting IRQ flooding and use a threaded irq is the
> solution, can you explain?

When IRQ flood is detected, we reserve a bit little time for providing
chance to make softirq/threads scheduled by scheduler, then the above
problem can be avoided.

> 
> If the responses are coming at a very high rate, whatever the solution
> (interrupts, threaded interrupts, polling), we are still in the same
> situation.

When we moving the interrupt handling into irq thread, other softirq/
threaded interrupt/thread gets chance to be scheduled, so we can avoid
to stall the whole system.

> 
> My suggestion was initially to see if the interrupt load will be taken
> into accounts in the cpu load and favorize task migration with the
> scheduler load balance to a less loaded CPU, thus the CPU processing
> interrupts will end up doing only that while other CPUs will handle the
> "threaded" side.
> 
> Beside that, I'm wondering if the block scheduler should be somehow
> involved in that [1]

For NVMe or any multi-queue storage, the default scheduler is 'none',
which basically does nothing except for submitting IO asap.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:22                                     ` Long Li
@ 2019-09-06  4:36                                       ` Daniel Lezcano
  2019-09-06  4:44                                         ` Long Li
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-06  4:36 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Bart Van Assche, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


On 06/09/2019 03:22, Long Li wrote:
[ ... ]
> 

> Tracing shows that the CPU was in either hardirq or softirq all the
> time before warnings. During tests, the system was unresponsive at
> times.
> 
> Ming's patch fixed this problem. The system was responsive throughout
> tests.
> 
> As for performance hit, both resulted in a small drop in peak IOPS.
> With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1%
> drop.

Do you mean IRQ_TIME_ACCOUNTING + irq threaded ?


> For the tests, I used the following fio command on 10 NVMe disks: fio
> --bs=4k --ioengine=libaio --iodepth=128
> --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
> --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test
> --group_reporting --gtod_reduce=1

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  4:36                                       ` Daniel Lezcano
@ 2019-09-06  4:44                                         ` Long Li
  0 siblings, 0 replies; 64+ messages in thread
From: Long Li @ 2019-09-06  4:44 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Bart Van Assche, Jens Axboe, Hannes Reinecke, Sagi Grimberg,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>
>On 06/09/2019 03:22, Long Li wrote:
>[ ... ]
>>
>
>> Tracing shows that the CPU was in either hardirq or softirq all the
>> time before warnings. During tests, the system was unresponsive at
>> times.
>>
>> Ming's patch fixed this problem. The system was responsive throughout
>> tests.
>>
>> As for performance hit, both resulted in a small drop in peak IOPS.
>> With IRQ_TIME_ACCOUNTING I see a 3% drop. With Ming's patch it is 1%
>> drop.
>
>Do you mean IRQ_TIME_ACCOUNTING + irq threaded ?

It's just IRQ_TIME_ACCOUNTING.

>
>
>> For the tests, I used the following fio command on 10 NVMe disks: fio
>> --bs=4k --ioengine=libaio --iodepth=128
>> --
>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev
>/nv
>>
>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/n
>vme9n1
>> --direct=1 --runtime=12000 --numjobs=80 --rw=randread --name=test
>> --group_reporting --gtod_reduce=1
>
>--
>
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cf142f9f9e
>15145434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0
>%7C637033413903343340&amp;sdata=FRCGiKyxpdqyIPob1nWITGvymRdI3fSG
>vyBJovpwVw4%3D&amp;reserved=0> Linaro.org │ Open source software for
>ARM SoCs
>
>Follow Linaro:
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.facebook.com%2Fpages%2FLinaro&amp;data=02%7C01%7Clongli%40microso
>ft.com%7Cf142f9f9e15145434dd608d73283c817%7C72f988bf86f141af91ab2d7c
>d011db47%7C1%7C0%7C637033413903343340&amp;sdata=P6t7wiGUESJoFuKi
>u3VrjRMGBYUWAW7TEYinUiFrlQs%3D&amp;reserved=0> Facebook |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftwitt
>er.com%2F%23!%2Flinaroorg&amp;data=02%7C01%7Clongli%40microsoft.co
>m%7Cf142f9f9e15145434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011
>db47%7C1%7C0%7C637033413903343340&amp;sdata=UB%2FOZZ1Mz38PQiDa
>BiJOHS4qr%2FWCejI0aKX9JRPNZ3s%3D&amp;reserved=0> Twitter |
><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww
>.linaro.org%2Flinaro-
>blog%2F&amp;data=02%7C01%7Clongli%40microsoft.com%7Cf142f9f9e15145
>434dd608d73283c817%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6
>37033413903343340&amp;sdata=7%2BrawoAWuFzou90GTgIUJV%2Fasv2N2Og
>ciePvYmblDFM%3D&amp;reserved=0> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:48                                     ` Ming Lei
@ 2019-09-06  5:14                                       ` Daniel Lezcano
  2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 14:18                                       ` Keith Busch
  1 sibling, 1 reply; 64+ messages in thread
From: Daniel Lezcano @ 2019-09-06  5:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg


Hi,

On 06/09/2019 03:48, Ming Lei wrote:

[ ... ]

>> You did not share yet the analysis of the problem (the kernel warnings
>> give the symptoms) and gave the reasoning for the solution. It is hard
>> to understand what you are looking for exactly and how to connect the dots.
> 
> Let me explain it one more time:>
> When one IRQ flood happens on one CPU:
> 
> 1) softirq handling on this CPU can't make progress
> 
> 2) kernel thread bound to this CPU can't make progress
> 
> For example, network may require softirq to xmit packets, or another irq
> thread for handling keyboards/mice or whatever, or rcu_sched may depend
> on that CPU for making progress, then the irq flood stalls the whole
> system.
> 
>>
>> AFAIU, there are fast medium where the responses to requests are faster
>> than the time to process them, right?
> 
> Usually medium may not be faster than CPU, now we are talking about
> interrupts, which can be originated from lots of devices concurrently,
> for example, in Long Li'test, there are 8 NVMe drives involved.
> 
>>
>> I don't see how detecting IRQ flooding and use a threaded irq is the
>> solution, can you explain?
> 
> When IRQ flood is detected, we reserve a bit little time for providing
> chance to make softirq/threads scheduled by scheduler, then the above
> problem can be avoided.
> 
>>
>> If the responses are coming at a very high rate, whatever the solution
>> (interrupts, threaded interrupts, polling), we are still in the same
>> situation.
> 
> When we moving the interrupt handling into irq thread, other softirq/
> threaded interrupt/thread gets chance to be scheduled, so we can avoid
> to stall the whole system.

Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.

I do believe you should also rely on the IRQ_TIME_ACCOUNTING (may be get
it optimized) to contribute to the CPU load and enforce task migration
at load balance.

>> My suggestion was initially to see if the interrupt load will be taken
>> into accounts in the cpu load and favorize task migration with the
>> scheduler load balance to a less loaded CPU, thus the CPU processing
>> interrupts will end up doing only that while other CPUs will handle the
>> "threaded" side.
>>
>> Beside that, I'm wondering if the block scheduler should be somehow
>> involved in that [1]
> 
> For NVMe or any multi-queue storage, the default scheduler is 'none',
> which basically does nothing except for submitting IO asap.
> 
> 
> Thanks,
> Ming
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD
  2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
  2019-08-27 14:35   ` Keith Busch
@ 2019-09-06  8:50   ` John Garry
  1 sibling, 0 replies; 64+ messages in thread
From: John Garry @ 2019-09-06  8:50 UTC (permalink / raw)
  To: Ming Lei, Thomas Gleixner
  Cc: linux-kernel, Long Li, Ingo Molnar, Peter Zijlstra, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Hannes Reinecke,
	linux-nvme, linux-scsi, chenxiang, daniel.lezcano

On 27/08/2019 09:53, Ming Lei wrote:
> In case of IRQF_RESCUE_THREAD, the threaded handler is only used to
> handle interrupt when IRQ flood comes, use irq's affinity for this thread
> so that scheduler may select other not too busy CPUs for handling the
> interrupt.
>
> Cc: Long Li <longli@microsoft.com>
> Cc: Ingo Molnar <mingo@redhat.com>,
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: linux-nvme@lists.infradead.org
> Cc: linux-scsi@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>


> ---
>  kernel/irq/manage.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1566abbf50e8..03bc041348b7 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -968,7 +968,18 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action)
>  	if (cpumask_available(desc->irq_common_data.affinity)) {
>  		const struct cpumask *m;
>
> -		m = irq_data_get_effective_affinity_mask(&desc->irq_data);
> +		/*
> +		 * Managed IRQ's affinity is setup gracefull on MUNA locality,

gracefully

> +		 * also if IRQF_RESCUE_THREAD is set, interrupt flood has been
> +		 * triggered, so ask scheduler to run the thread on CPUs
> +		 * specified by this interrupt's affinity.
> +		 */

Hi Ming,

> +		if ((action->flags & IRQF_RESCUE_THREAD) &&
> +				irqd_affinity_is_managed(&desc->irq_data))

This doesn't look to solve the other issue I reported - that being that 
we handle the interrupt in a threaded handler natively, and the hard 
irq+threaded handler fully occupies the cpu, limiting throughput.

So can we expand the scope to cover that scenario also? I don't think 
that it’s right to solve that separately. So if we're continuing this 
approach, can we add separate judgment for spreading the cpumask for the 
threaded part?

Thanks,
John

> +			m = desc->irq_common_data.affinity;
> +		else
> +			m = irq_data_get_effective_affinity_mask(
> +					&desc->irq_data);
>  		cpumask_copy(mask, m);
>  	} else {
>  		valid = false;
>



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  1:48                                     ` Ming Lei
  2019-09-06  5:14                                       ` Daniel Lezcano
@ 2019-09-06 14:18                                       ` Keith Busch
  2019-09-06 17:50                                         ` Long Li
  1 sibling, 1 reply; 64+ messages in thread
From: Keith Busch @ 2019-09-06 14:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Daniel Lezcano, Keith Busch, Hannes Reinecke, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> When one IRQ flood happens on one CPU:
> 
> 1) softirq handling on this CPU can't make progress
> 
> 2) kernel thread bound to this CPU can't make progress
> 
> For example, network may require softirq to xmit packets, or another irq
> thread for handling keyboards/mice or whatever, or rcu_sched may depend
> on that CPU for making progress, then the irq flood stalls the whole
> system.
> 
> > 
> > AFAIU, there are fast medium where the responses to requests are faster
> > than the time to process them, right?
> 
> Usually medium may not be faster than CPU, now we are talking about
> interrupts, which can be originated from lots of devices concurrently,
> for example, in Long Li'test, there are 8 NVMe drives involved.

Why are all 8 nvmes sharing the same CPU for interrupt handling?
Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
CPU from the cpumask for the effective interrupt handling?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 14:18                                       ` Keith Busch
@ 2019-09-06 17:50                                         ` Long Li
  2019-09-06 22:19                                           ` Ming Lei
  0 siblings, 1 reply; 64+ messages in thread
From: Long Li @ 2019-09-06 17:50 UTC (permalink / raw)
  To: Keith Busch, Ming Lei
  Cc: Daniel Lezcano, Keith Busch, Hannes Reinecke, Bart Van Assche,
	linux-scsi, Peter Zijlstra, John Garry, LKML, linux-nvme,
	Jens Axboe, Ingo Molnar, Thomas Gleixner, Christoph Hellwig,
	Sagi Grimberg

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
>> When one IRQ flood happens on one CPU:
>>
>> 1) softirq handling on this CPU can't make progress
>>
>> 2) kernel thread bound to this CPU can't make progress
>>
>> For example, network may require softirq to xmit packets, or another
>> irq thread for handling keyboards/mice or whatever, or rcu_sched may
>> depend on that CPU for making progress, then the irq flood stalls the
>> whole system.
>>
>> >
>> > AFAIU, there are fast medium where the responses to requests are
>> > faster than the time to process them, right?
>>
>> Usually medium may not be faster than CPU, now we are talking about
>> interrupts, which can be originated from lots of devices concurrently,
>> for example, in Long Li'test, there are 8 NVMe drives involved.
>
>Why are all 8 nvmes sharing the same CPU for interrupt handling?
>Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
>CPU from the cpumask for the effective interrupt handling?

The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06  5:14                                       ` Daniel Lezcano
@ 2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 18:52                                           ` Keith Busch
  2019-09-07  0:01                                           ` Ming Lei
  0 siblings, 2 replies; 64+ messages in thread
From: Sagi Grimberg @ 2019-09-06 18:30 UTC (permalink / raw)
  To: Daniel Lezcano, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, Bart Van Assche, linux-scsi,
	Peter Zijlstra, Long Li, John Garry, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


> 
> Ok, so the real problem is per-cpu bounded tasks.
> 
> I share Thomas opinion about a NAPI like approach.

We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 18:30                                         ` Sagi Grimberg
@ 2019-09-06 18:52                                           ` Keith Busch
  2019-09-07  0:01                                           ` Ming Lei
  1 sibling, 0 replies; 64+ messages in thread
From: Keith Busch @ 2019-09-06 18:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Daniel Lezcano, Ming Lei, Jens Axboe, Hannes Reinecke,
	Bart Van Assche, linux-scsi, Peter Zijlstra, Long Li, John Garry,
	LKML, linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote:
> 
> > 
> > Ok, so the real problem is per-cpu bounded tasks.
> > 
> > I share Thomas opinion about a NAPI like approach.
> 
> We already have that, its irq_poll, but it seems that for this
> use-case, we get lower performance for some reason. I'm not
> entirely sure why that is, maybe its because we need to mask interrupts
> because we don't have an "arm" register in nvme like network devices
> have?

For MSI, that's the INTMS/INTMC NVMe registers. MSI-x, though, has to
disarm it in its table entry, and the Linux implementation will do a
posted read in that path, which is a bit too expensive.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 17:50                                         ` Long Li
@ 2019-09-06 22:19                                           ` Ming Lei
  2019-09-06 22:25                                             ` Keith Busch
  2019-09-10  0:24                                             ` Ming Lei
  0 siblings, 2 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-06 22:19 UTC (permalink / raw)
  To: Long Li
  Cc: Keith Busch, Daniel Lezcano, Keith Busch, Hannes Reinecke,
	Bart Van Assche, linux-scsi, Peter Zijlstra, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> >
> >On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> >> When one IRQ flood happens on one CPU:
> >>
> >> 1) softirq handling on this CPU can't make progress
> >>
> >> 2) kernel thread bound to this CPU can't make progress
> >>
> >> For example, network may require softirq to xmit packets, or another
> >> irq thread for handling keyboards/mice or whatever, or rcu_sched may
> >> depend on that CPU for making progress, then the irq flood stalls the
> >> whole system.
> >>
> >> >
> >> > AFAIU, there are fast medium where the responses to requests are
> >> > faster than the time to process them, right?
> >>
> >> Usually medium may not be faster than CPU, now we are talking about
> >> interrupts, which can be originated from lots of devices concurrently,
> >> for example, in Long Li'test, there are 8 NVMe drives involved.
> >
> >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> >CPU from the cpumask for the effective interrupt handling?
> 
> The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.

Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
can't avoid effective CPUs overlapping at all.

> It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
> Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.

Another improvement could be to try to not overlap effective CPUs among
vectors of fast device first, meantime allow the overlap between slow
vectors and fast vectors.

This way could improve in case that total fast vectors are <= nr_cpu_cores.

thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:19                                           ` Ming Lei
@ 2019-09-06 22:25                                             ` Keith Busch
  2019-09-06 23:13                                               ` Ming Lei
  2019-09-10  0:24                                             ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Keith Busch @ 2019-09-06 22:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Long Li, Daniel Lezcano, Keith Busch, Hannes Reinecke,
	Bart Van Assche, linux-scsi, Peter Zijlstra, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote:
> On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > >
> > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > >CPU from the cpumask for the effective interrupt handling?
> > 
> > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> 
> Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> can't avoid effective CPUs overlapping at all.

Sure, but it's at most half, meanwhile the CPU that's dispatching requests
would naturally be throttled by the other half who's completions are
interrupting that CPU, no?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:25                                             ` Keith Busch
@ 2019-09-06 23:13                                               ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-06 23:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Long Li, Daniel Lezcano, Keith Busch, Hannes Reinecke,
	Bart Van Assche, linux-scsi, Peter Zijlstra, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Fri, Sep 06, 2019 at 04:25:55PM -0600, Keith Busch wrote:
> On Sat, Sep 07, 2019 at 06:19:21AM +0800, Ming Lei wrote:
> > On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > > >
> > > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > > >CPU from the cpumask for the effective interrupt handling?
> > > 
> > > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> > 
> > Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> > can't avoid effective CPUs overlapping at all.
> 
> Sure, but it's at most half, meanwhile the CPU that's dispatching requests
> would naturally be throttled by the other half who's completions are
> interrupting that CPU, no?

The root cause is that multiple submission vs. single completion.

Let's see two cases:

1) 10 NVMe, each 8 queues, 80 CPU cores
- suppose genriq matrix can avoid effective cpu overlap, each cpu
  only handles one nvme interrupt
- there can be concurrent submissions from 10 CPUs, and all may be
  completed on one CPU
- IRQ flood couldn't happen for this case, given each CPU is only
  handling completion from one NVMe drive, which shouldn't be fast
  than CPU.

2) 10 NVMe, each 32 queues, 80 CPU cores
- one CPU may handle 4 NVMe interrupts, each from different NVMe drive
- then there may be 4*3 submissions aimed at single completion, then IRQ
  flood should be easy triggered on CPU for handing 4 NVMe interrupts.
  Because IO from 4 NVMe drive may be quicker than one CPU.

I can observe IRQ flood on the case #1, but there are still CPUs for
handling 2 NVMe interrupt, as the reason mentioned by Long. We could
improve for this case.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 18:30                                         ` Sagi Grimberg
  2019-09-06 18:52                                           ` Keith Busch
@ 2019-09-07  0:01                                           ` Ming Lei
  2019-09-10  3:10                                             ` Sagi Grimberg
  1 sibling, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-09-07  0:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Daniel Lezcano, Keith Busch, Hannes Reinecke, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Fri, Sep 06, 2019 at 11:30:57AM -0700, Sagi Grimberg wrote:
> 
> > 
> > Ok, so the real problem is per-cpu bounded tasks.
> > 
> > I share Thomas opinion about a NAPI like approach.
> 
> We already have that, its irq_poll, but it seems that for this
> use-case, we get lower performance for some reason. I'm not
> entirely sure why that is, maybe its because we need to mask interrupts
> because we don't have an "arm" register in nvme like network devices
> have?

Long observed that IOPS drops much too by switching to threaded irq. If
softirqd is waken up for handing softirq, the performance shouldn't
be better than threaded irq. Especially, Long found that context
switch is increased a lot after applying your irq poll patch.

http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-06 22:19                                           ` Ming Lei
  2019-09-06 22:25                                             ` Keith Busch
@ 2019-09-10  0:24                                             ` Ming Lei
  1 sibling, 0 replies; 64+ messages in thread
From: Ming Lei @ 2019-09-10  0:24 UTC (permalink / raw)
  To: Long Li
  Cc: Keith Busch, Daniel Lezcano, Keith Busch, Hannes Reinecke,
	Bart Van Assche, linux-scsi, Peter Zijlstra, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig, Sagi Grimberg

On Sat, Sep 07, 2019 at 06:19:20AM +0800, Ming Lei wrote:
> On Fri, Sep 06, 2019 at 05:50:49PM +0000, Long Li wrote:
> > >Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
> > >
> > >On Fri, Sep 06, 2019 at 09:48:21AM +0800, Ming Lei wrote:
> > >> When one IRQ flood happens on one CPU:
> > >>
> > >> 1) softirq handling on this CPU can't make progress
> > >>
> > >> 2) kernel thread bound to this CPU can't make progress
> > >>
> > >> For example, network may require softirq to xmit packets, or another
> > >> irq thread for handling keyboards/mice or whatever, or rcu_sched may
> > >> depend on that CPU for making progress, then the irq flood stalls the
> > >> whole system.
> > >>
> > >> >
> > >> > AFAIU, there are fast medium where the responses to requests are
> > >> > faster than the time to process them, right?
> > >>
> > >> Usually medium may not be faster than CPU, now we are talking about
> > >> interrupts, which can be originated from lots of devices concurrently,
> > >> for example, in Long Li'test, there are 8 NVMe drives involved.
> > >
> > >Why are all 8 nvmes sharing the same CPU for interrupt handling?
> > >Shouldn't matrix_find_best_cpu_managed() handle selecting the least used
> > >CPU from the cpumask for the effective interrupt handling?
> > 
> > The tests run on 10 NVMe disks on a system of 80 CPUs. Each NVMe disk has 32 hardware queues.
> 
> Then there are total 320 NVMe MSI/X vectors, and 80 CPUs, so irq matrix
> can't avoid effective CPUs overlapping at all.
> 
> > It seems matrix_find_best_cpu_managed() has done its job, but we may still have CPUs that service several hardware queues mapped from other issuing CPUs.
> > Another thing to consider is that there may be other managed interrupts on the system, so NVMe interrupts may not end up evenly distributed on such a system.
> 
> Another improvement could be to try to not overlap effective CPUs among
> vectors of fast device first, meantime allow the overlap between slow
> vectors and fast vectors.
> 
> This way could improve in case that total fast vectors are <= nr_cpu_cores.

For this particular case, it can't be done, because:

1) this machine has 10 NUMA nodes, and each NVMe has 8 hw queues, so too
many CPUs are assigned to the 1st two hw queues, see the code branch of
'if (numvecs <= nodes)' in __irq_build_affinity_masks().

2) then less CPUs are assigned to the other 6 hw queues

3) finally same effective CPU is shared by two IRQ vector.

Also looks matrix_find_best_cpu_managed() has been doing well enough for
choosing best effective CPU.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-07  0:01                                           ` Ming Lei
@ 2019-09-10  3:10                                             ` Sagi Grimberg
  2019-09-18  0:00                                               ` Long Li
  2019-09-18 14:37                                               ` Ming Lei
  0 siblings, 2 replies; 64+ messages in thread
From: Sagi Grimberg @ 2019-09-10  3:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, Daniel Lezcano, LKML,
	linux-nvme, Keith Busch, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

Hey Ming,

>>> Ok, so the real problem is per-cpu bounded tasks.
>>>
>>> I share Thomas opinion about a NAPI like approach.
>>
>> We already have that, its irq_poll, but it seems that for this
>> use-case, we get lower performance for some reason. I'm not
>> entirely sure why that is, maybe its because we need to mask interrupts
>> because we don't have an "arm" register in nvme like network devices
>> have?
> 
> Long observed that IOPS drops much too by switching to threaded irq. If
> softirqd is waken up for handing softirq, the performance shouldn't
> be better than threaded irq.

Its true that it shouldn't be any faster, but what irqpoll already has
and we don't need to reinvent is a proper budgeting mechanism that
needs to occur when multiple devices map irq vectors to the same cpu
core.

irqpoll already maintains a percpu list and dispatch the ->poll with
a budget that the backend enforces and irqpoll multiplexes between them.
Having this mechanism in irq (hard or threaded) context sounds
unnecessary a bit.

It seems like we're attempting to stay in irq context for as long as we
can instead of scheduling to softirq/thread context if we have more than
a minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong
approach to me. Interrupt context will always be faster, but it is
not a sufficient reason to spend as much time as possible there, is it?

We should also keep in mind, that the networking stack has been doing
this for years, I would try to understand why this cannot work for nvme
before dismissing.

> Especially, Long found that context
> switch is increased a lot after applying your irq poll patch.
> 
> http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html

Oh, I didn't see that one, wonder why... thanks!

5% improvement, I guess we can buy that for other users as is :)

If we suffer from lots of context switches while the CPU is flooded with
interrupts, then I would argue that we're re-raising softirq too much.
In this use-case, my assumption is that the cpu cannot keep up with the
interrupts and not that it doesn't reap enough (we also reap the first
batch in interrupt context...)

Perhaps making irqpoll continue until it must resched would improve
things further? Although this is a latency vs. efficiency tradeoff,
looks like MAX_SOFTIRQ_TIME is set to 2ms:

"
  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
  * well to make sure we eventually return from this method.
  *
  * These limits have been established via experimentation.
  * The two things to balance is latency against fairness -
  * we want to handle softirqs as soon as possible, but they
  * should not be able to lock up the box.
"

Long, does this patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..d8eab563fa77 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,8 +12,6 @@
  #include <linux/irq_poll.h>
  #include <linux/delay.h>

-static unsigned int irq_poll_budget __read_mostly = 256;
-
  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

  /**
@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);

  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
  {
-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-       int rearm = 0, budget = irq_poll_budget;
-       unsigned long start_time = jiffies;
+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+       LIST_HEAD(list);

         local_irq_disable();
+       list_splice_init(irqpoll_list, &list);
+       local_irq_enable();

-       while (!list_empty(list)) {
+       while (!list_empty(&list)) {
                 struct irq_poll *iop;
                 int work, weight;

-               /*
-                * If softirq window is exhausted then punt.
-                */
-               if (budget <= 0 || time_after(jiffies, start_time)) {
-                       rearm = 1;
-                       break;
-               }
-
-               local_irq_enable();
-
                 /* Even though interrupts have been re-enabled, this
                  * access is safe because interrupts can only add new
                  * entries to the tail of this list, and only ->poll()
                  * calls can remove this head entry from the list.
                  */
-               iop = list_entry(list->next, struct irq_poll, list);
+               iop = list_first_entry(&list, struct irq_poll, list);

                 weight = iop->weight;
                 work = 0;
                 if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
                         work = iop->poll(iop, weight);

-               budget -= work;
-
-               local_irq_disable();
-
                 /*
                  * Drivers must not modify the iopoll state, if they
                  * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +110,21 @@ static void __latent_entropy 
irq_poll_softirq(struct softirq_action *h)
                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
                                 __irq_poll_complete(iop);
                         else
-                               list_move_tail(&iop->list, list);
+                               list_move_tail(&iop->list, &list);
                 }
+
+               /*
+                * If softirq window is exhausted then punt.
+                */
+               if (need_resched())
+                       break;
         }

-       if (rearm)
+       local_irq_disable();
+
+       list_splice_tail_init(irqpoll_list, &list);
+       list_splice(&list, irqpoll_list);
+       if (!list_empty(irqpoll_list))
                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

         local_irq_enable();
--

Reminder to the nvme side (slightly modified):
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 52205f8d90b4..09dc6da67b05 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
  #include <linux/io-64-nonatomic-lo-hi.h>
  #include <linux/sed-opal.h>
  #include <linux/pci-p2pdma.h>
+#include <linux/irq_poll.h>

  #include "trace.h"
  #include "nvme.h"
@@ -32,6 +33,7 @@
  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))

  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ   256

  /*
   * These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
         u32 *dbbuf_cq_db;
         u32 *dbbuf_sq_ei;
         u32 *dbbuf_cq_ei;
+       struct irq_poll iop;
         struct completion delete_done;
  };

@@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct 
nvme_queue *nvmeq, u16 *start,
         return found;
  }

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, 
iop);
+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+       u16 start, end;
+       int completed;
+
+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
+       nvme_complete_cqes(nvmeq, start, end);
+       if (completed < budget) {
+               irq_poll_complete(&nvmeq->iop);
+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+       }
+
+       return completed;
+}
+
  static irqreturn_t nvme_irq(int irq, void *data)
  {
         struct nvme_queue *nvmeq = data;
         irqreturn_t ret = IRQ_NONE;
         u16 start, end;
+       int budget = nvmeq->q_depth;

         /*
          * The rmb/wmb pair ensures we see all updates from a previous 
run of
@@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data)
         rmb();
         if (nvmeq->cq_head != nvmeq->last_cq_head)
                 ret = IRQ_HANDLED;
-       nvme_process_cq(nvmeq, &start, &end, -1);
+
+       /* reap here up to a budget of the size the queue depth */
+       do {
+               budget -= nvme_process_cq(nvmeq, &start, &end, budget);
+               if (start != end) {
+                       nvme_complete_cqes(nvmeq, start, end);
+                       ret = IRQ_HANDLED;
+               }
+       } while (start != end && budget > 0);
+
         nvmeq->last_cq_head = nvmeq->cq_head;
         wmb();

-       if (start != end) {
-               nvme_complete_cqes(nvmeq, start, end);
-               return IRQ_HANDLED;
+       /* if we still have cqes to reap, schedule irqpoll */
+       if (start != end && nvme_cqe_pending(nvmeq)) {
+               disable_irq_nosync(irq);
+               irq_poll_sched(&nvmeq->iop);
         }

         return ret;
@@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return 
nvme_timeout(struct request *req, bool reserved)

  static void nvme_free_queue(struct nvme_queue *nvmeq)
  {
+       irq_poll_disable(&nvmeq->iop);
         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
         if (!nvmeq->sq_cmds)
@@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, 
int qid, int depth)
         nvmeq->dev = dev;
         spin_lock_init(&nvmeq->sq_lock);
         spin_lock_init(&nvmeq->cq_poll_lock);
+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, 
nvme_irqpoll_handler);
         nvmeq->cq_head = 0;
         nvmeq->cq_phase = 1;
         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-10  3:10                                             ` Sagi Grimberg
@ 2019-09-18  0:00                                               ` Long Li
  2019-09-20 17:14                                                 ` Sagi Grimberg
  2019-09-18 14:37                                               ` Ming Lei
  1 sibling, 1 reply; 64+ messages in thread
From: Long Li @ 2019-09-18  0:00 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

>Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>
>Hey Ming,
>
>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>
>>>> I share Thomas opinion about a NAPI like approach.
>>>
>>> We already have that, its irq_poll, but it seems that for this
>>> use-case, we get lower performance for some reason. I'm not entirely
>>> sure why that is, maybe its because we need to mask interrupts
>>> because we don't have an "arm" register in nvme like network devices
>>> have?
>>
>> Long observed that IOPS drops much too by switching to threaded irq.
>> If softirqd is waken up for handing softirq, the performance shouldn't
>> be better than threaded irq.
>
>Its true that it shouldn't be any faster, but what irqpoll already has and we
>don't need to reinvent is a proper budgeting mechanism that needs to occur
>when multiple devices map irq vectors to the same cpu core.
>
>irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>that the backend enforces and irqpoll multiplexes between them.
>Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>bit.
>
>It seems like we're attempting to stay in irq context for as long as we can
>instead of scheduling to softirq/thread context if we have more than a
>minimal amount of work to do. Without at least understanding why
>softirq/thread degrades us so much this code seems like the wrong approach
>to me. Interrupt context will always be faster, but it is not a sufficient reason
>to spend as much time as possible there, is it?
>
>We should also keep in mind, that the networking stack has been doing this
>for years, I would try to understand why this cannot work for nvme before
>dismissing.
>
>> Especially, Long found that context
>> switch is increased a lot after applying your irq poll patch.
>>
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>August%2F026788.html&am
>>
>p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>59c
>>
>64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>742&a
>>
>mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>p;reserved
>> =0
>
>Oh, I didn't see that one, wonder why... thanks!
>
>5% improvement, I guess we can buy that for other users as is :)
>
>If we suffer from lots of context switches while the CPU is flooded with
>interrupts, then I would argue that we're re-raising softirq too much.
>In this use-case, my assumption is that the cpu cannot keep up with the
>interrupts and not that it doesn't reap enough (we also reap the first batch in
>interrupt context...)
>
>Perhaps making irqpoll continue until it must resched would improve things
>further? Although this is a latency vs. efficiency tradeoff, looks like
>MAX_SOFTIRQ_TIME is set to 2ms:
>
>"
>  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>  * certain cases, such as stop_machine(), jiffies may cease to
>  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>  * well to make sure we eventually return from this method.
>  *
>  * These limits have been established via experimentation.
>  * The two things to balance is latency against fairness -
>  * we want to handle softirqs as soon as possible, but they
>  * should not be able to lock up the box.
>"
>
>Long, does this patch make any difference?

Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.

The following are captured by "perf sched record" for 30 seconds during tests.

"perf sched latency"
With patch:
  fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s

without patch:
  fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
(captured in /sys/kernel/debug/tracing, echo sched:* >set_event)

On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
           <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
           <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
           <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120

On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
          <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
          <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
             fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120

Thanks

Long

>--
>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..d8eab563fa77
>100644
>--- a/lib/irq_poll.c
>+++ b/lib/irq_poll.c
>@@ -12,8 +12,6 @@
>  #include <linux/irq_poll.h>
>  #include <linux/delay.h>
>
>-static unsigned int irq_poll_budget __read_mostly = 256;
>-
>  static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>
>  /**
>@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);
>
>  static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>  {
>-       struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>-       int rearm = 0, budget = irq_poll_budget;
>-       unsigned long start_time = jiffies;
>+       struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>+       LIST_HEAD(list);
>
>         local_irq_disable();
>+       list_splice_init(irqpoll_list, &list);
>+       local_irq_enable();
>
>-       while (!list_empty(list)) {
>+       while (!list_empty(&list)) {
>                 struct irq_poll *iop;
>                 int work, weight;
>
>-               /*
>-                * If softirq window is exhausted then punt.
>-                */
>-               if (budget <= 0 || time_after(jiffies, start_time)) {
>-                       rearm = 1;
>-                       break;
>-               }
>-
>-               local_irq_enable();
>-
>                 /* Even though interrupts have been re-enabled, this
>                  * access is safe because interrupts can only add new
>                  * entries to the tail of this list, and only ->poll()
>                  * calls can remove this head entry from the list.
>                  */
>-               iop = list_entry(list->next, struct irq_poll, list);
>+               iop = list_first_entry(&list, struct irq_poll, list);
>
>                 weight = iop->weight;
>                 work = 0;
>                 if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
>                         work = iop->poll(iop, weight);
>
>-               budget -= work;
>-
>-               local_irq_disable();
>-
>                 /*
>                  * Drivers must not modify the iopoll state, if they
>                  * consume their assigned weight (or more, some drivers can't @@ -
>125,11 +110,21 @@ static void __latent_entropy irq_poll_softirq(struct
>softirq_action *h)
>                         if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>                                 __irq_poll_complete(iop);
>                         else
>-                               list_move_tail(&iop->list, list);
>+                               list_move_tail(&iop->list, &list);
>                 }
>+
>+               /*
>+                * If softirq window is exhausted then punt.
>+                */
>+               if (need_resched())
>+                       break;
>         }
>
>-       if (rearm)
>+       local_irq_disable();
>+
>+       list_splice_tail_init(irqpoll_list, &list);
>+       list_splice(&list, irqpoll_list);
>+       if (!list_empty(irqpoll_list))
>                 __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>
>         local_irq_enable();
>--
>
>Reminder to the nvme side (slightly modified):
>--
>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
>52205f8d90b4..09dc6da67b05 100644
>--- a/drivers/nvme/host/pci.c
>+++ b/drivers/nvme/host/pci.c
>@@ -24,6 +24,7 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/sed-opal.h>
>  #include <linux/pci-p2pdma.h>
>+#include <linux/irq_poll.h>
>
>  #include "trace.h"
>  #include "nvme.h"
>@@ -32,6 +33,7 @@
>  #define CQ_SIZE(q)     ((q)->q_depth * sizeof(struct nvme_completion))
>
>  #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>+#define NVME_POLL_BUDGET_IRQ   256
>
>  /*
>   * These can be higher, but we need to ensure that any command doesn't
>@@ -189,6 +191,7 @@ struct nvme_queue {
>         u32 *dbbuf_cq_db;
>         u32 *dbbuf_sq_ei;
>         u32 *dbbuf_cq_ei;
>+       struct irq_poll iop;
>         struct completion delete_done;
>  };
>
>@@ -1014,11 +1017,29 @@ static inline int nvme_process_cq(struct
>nvme_queue *nvmeq, u16 *start,
>         return found;
>  }
>
>+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) {
>+       struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue,
>iop);
>+       struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
>+       u16 start, end;
>+       int completed;
>+
>+       completed = nvme_process_cq(nvmeq, &start, &end, budget);
>+       nvme_complete_cqes(nvmeq, start, end);
>+       if (completed < budget) {
>+               irq_poll_complete(&nvmeq->iop);
>+               enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>+       }
>+
>+       return completed;
>+}
>+
>  static irqreturn_t nvme_irq(int irq, void *data)
>  {
>         struct nvme_queue *nvmeq = data;
>         irqreturn_t ret = IRQ_NONE;
>         u16 start, end;
>+       int budget = nvmeq->q_depth;
>
>         /*
>          * The rmb/wmb pair ensures we see all updates from a previous run of
>@@ -1027,13 +1048,23 @@ static irqreturn_t nvme_irq(int irq, void *data)
>         rmb();
>         if (nvmeq->cq_head != nvmeq->last_cq_head)
>                 ret = IRQ_HANDLED;
>-       nvme_process_cq(nvmeq, &start, &end, -1);
>+
>+       /* reap here up to a budget of the size the queue depth */
>+       do {
>+               budget -= nvme_process_cq(nvmeq, &start, &end, budget);
>+               if (start != end) {
>+                       nvme_complete_cqes(nvmeq, start, end);
>+                       ret = IRQ_HANDLED;
>+               }
>+       } while (start != end && budget > 0);
>+
>         nvmeq->last_cq_head = nvmeq->cq_head;
>         wmb();
>
>-       if (start != end) {
>-               nvme_complete_cqes(nvmeq, start, end);
>-               return IRQ_HANDLED;
>+       /* if we still have cqes to reap, schedule irqpoll */
>+       if (start != end && nvme_cqe_pending(nvmeq)) {
>+               disable_irq_nosync(irq);
>+               irq_poll_sched(&nvmeq->iop);
>         }
>
>         return ret;
>@@ -1346,6 +1377,7 @@ static enum blk_eh_timer_return
>nvme_timeout(struct request *req, bool reserved)
>
>  static void nvme_free_queue(struct nvme_queue *nvmeq)
>  {
>+       irq_poll_disable(&nvmeq->iop);
>         dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
>                                 (void *)nvmeq->cqes, nvmeq->cq_dma_addr);
>         if (!nvmeq->sq_cmds)
>@@ -1480,6 +1512,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev,
>int qid, int depth)
>         nvmeq->dev = dev;
>         spin_lock_init(&nvmeq->sq_lock);
>         spin_lock_init(&nvmeq->cq_poll_lock);
>+       irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ,
>nvme_irqpoll_handler);
>         nvmeq->cq_head = 0;
>         nvmeq->cq_phase = 1;
>         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>--

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-10  3:10                                             ` Sagi Grimberg
  2019-09-18  0:00                                               ` Long Li
@ 2019-09-18 14:37                                               ` Ming Lei
  2019-09-20 17:09                                                 ` Sagi Grimberg
  1 sibling, 1 reply; 64+ messages in thread
From: Ming Lei @ 2019-09-18 14:37 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig

On Mon, Sep 09, 2019 at 08:10:07PM -0700, Sagi Grimberg wrote:
> Hey Ming,
> 
> > > > Ok, so the real problem is per-cpu bounded tasks.
> > > > 
> > > > I share Thomas opinion about a NAPI like approach.
> > > 
> > > We already have that, its irq_poll, but it seems that for this
> > > use-case, we get lower performance for some reason. I'm not
> > > entirely sure why that is, maybe its because we need to mask interrupts
> > > because we don't have an "arm" register in nvme like network devices
> > > have?
> > 
> > Long observed that IOPS drops much too by switching to threaded irq. If
> > softirqd is waken up for handing softirq, the performance shouldn't
> > be better than threaded irq.
> 
> Its true that it shouldn't be any faster, but what irqpoll already has
> and we don't need to reinvent is a proper budgeting mechanism that
> needs to occur when multiple devices map irq vectors to the same cpu
> core.
> 
> irqpoll already maintains a percpu list and dispatch the ->poll with
> a budget that the backend enforces and irqpoll multiplexes between them.
> Having this mechanism in irq (hard or threaded) context sounds
> unnecessary a bit.
> 
> It seems like we're attempting to stay in irq context for as long as we
> can instead of scheduling to softirq/thread context if we have more than
> a minimal amount of work to do. Without at least understanding why
> softirq/thread degrades us so much this code seems like the wrong
> approach to me. Interrupt context will always be faster, but it is
> not a sufficient reason to spend as much time as possible there, is it?

If extra latency is added in IO completion path, this latency will be
introduced in the submission path, because the hw queue depth is fixed,
which is often small. Especially in case of multiple submission vs.
single(shared) completion, the whole hw queue tags can be exhausted
easily. 

I guess no such effect for networking IO.

> 
> We should also keep in mind, that the networking stack has been doing
> this for years, I would try to understand why this cannot work for nvme
> before dismissing.

The above may be one reason.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-18 14:37                                               ` Ming Lei
@ 2019-09-20 17:09                                                 ` Sagi Grimberg
  0 siblings, 0 replies; 64+ messages in thread
From: Sagi Grimberg @ 2019-09-20 17:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Hannes Reinecke, Daniel Lezcano, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Long Li, John Garry, LKML,
	linux-nvme, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Christoph Hellwig


>> It seems like we're attempting to stay in irq context for as long as we
>> can instead of scheduling to softirq/thread context if we have more than
>> a minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong
>> approach to me. Interrupt context will always be faster, but it is
>> not a sufficient reason to spend as much time as possible there, is it?
> 
> If extra latency is added in IO completion path, this latency will be
> introduced in the submission path, because the hw queue depth is fixed,
> which is often small. Especially in case of multiple submission vs.
> single(shared) completion, the whole hw queue tags can be exhausted
> easily.

This is why the patch is reaping the first batch from hard-irq, but only
if it has more will defer to softirq. So I'm not sure the short QD use
case applies here...

> I guess no such effect for networking IO.

Maybe, maybe not...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-18  0:00                                               ` Long Li
@ 2019-09-20 17:14                                                 ` Sagi Grimberg
  2019-09-20 19:12                                                   ` Long Li
  0 siblings, 1 reply; 64+ messages in thread
From: Sagi Grimberg @ 2019-09-20 17:14 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


>> Hey Ming,
>>
>>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>>
>>>>> I share Thomas opinion about a NAPI like approach.
>>>>
>>>> We already have that, its irq_poll, but it seems that for this
>>>> use-case, we get lower performance for some reason. I'm not entirely
>>>> sure why that is, maybe its because we need to mask interrupts
>>>> because we don't have an "arm" register in nvme like network devices
>>>> have?
>>>
>>> Long observed that IOPS drops much too by switching to threaded irq.
>>> If softirqd is waken up for handing softirq, the performance shouldn't
>>> be better than threaded irq.
>>
>> Its true that it shouldn't be any faster, but what irqpoll already has and we
>> don't need to reinvent is a proper budgeting mechanism that needs to occur
>> when multiple devices map irq vectors to the same cpu core.
>>
>> irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>> that the backend enforces and irqpoll multiplexes between them.
>> Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>> bit.
>>
>> It seems like we're attempting to stay in irq context for as long as we can
>> instead of scheduling to softirq/thread context if we have more than a
>> minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong approach
>> to me. Interrupt context will always be faster, but it is not a sufficient reason
>> to spend as much time as possible there, is it?
>>
>> We should also keep in mind, that the networking stack has been doing this
>> for years, I would try to understand why this cannot work for nvme before
>> dismissing.
>>
>>> Especially, Long found that context
>>> switch is increased a lot after applying your irq poll patch.
>>>
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>> August%2F026788.html&am
>>>
>> p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>> 59c
>>>
>> 64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>> 742&a
>>>
>> mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>> p;reserved
>>> =0
>>
>> Oh, I didn't see that one, wonder why... thanks!
>>
>> 5% improvement, I guess we can buy that for other users as is :)
>>
>> If we suffer from lots of context switches while the CPU is flooded with
>> interrupts, then I would argue that we're re-raising softirq too much.
>> In this use-case, my assumption is that the cpu cannot keep up with the
>> interrupts and not that it doesn't reap enough (we also reap the first batch in
>> interrupt context...)
>>
>> Perhaps making irqpoll continue until it must resched would improve things
>> further? Although this is a latency vs. efficiency tradeoff, looks like
>> MAX_SOFTIRQ_TIME is set to 2ms:
>>
>> "
>>   * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>>   * certain cases, such as stop_machine(), jiffies may cease to
>>   * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>>   * well to make sure we eventually return from this method.
>>   *
>>   * These limits have been established via experimentation.
>>   * The two things to balance is latency against fairness -
>>   * we want to handle softirqs as soon as possible, but they
>>   * should not be able to lock up the box.
>> "
>>
>> Long, does this patch make any difference?
> 
> Sagi,
> 
> Sorry it took a while to bring my system back online.
> 
> With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.
> 
> The following are captured by "perf sched record" for 30 seconds during tests.
> 
> "perf sched latency"
> With patch:
>    fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s
> 
> without patch:
>    fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Without patch means the proposed hard-irq patch?

If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..

Is your kernel compiled with full preemption or voluntary preemption?

> Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
> (captured in /sys/kernel/debug/tracing, echo sched:* >set_event)
> 
> On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
>             <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
>             <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
> 
> On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
>            <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
>            <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=12

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-20 17:14                                                 ` Sagi Grimberg
@ 2019-09-20 19:12                                                   ` Long Li
  2019-09-20 20:45                                                     ` Sagi Grimberg
  0 siblings, 1 reply; 64+ messages in thread
From: Long Li @ 2019-09-20 19:12 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig

> >> Long, does this patch make any difference?
> >
> > Sagi,
> >
> > Sorry it took a while to bring my system back online.
> >
> > With the patch, the IOPS is about the same drop with the 1st patch. I think
> the excessive context switches are causing the drop in IOPS.
> >
> > The following are captured by "perf sched record" for 30 seconds during
> tests.
> >
> > "perf sched latency"
> > With patch:
> >    fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123
> ms | max at:    768.274023 s
> >
> > without patch:
> >    fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446
> ms | max at:   6447.310255 s
> 
> Without patch means the proposed hard-irq patch?

It means the current upstream code without any patch. But It's prone to soft lockup.

Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup.

> 
> If we are context switching too much, it means the soft-irq operation is not
> efficient, not necessarily the fact that the completion path is running in soft-
> irq..
> 
> Is your kernel compiled with full preemption or voluntary preemption?

The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters:

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

> 
> > Look closer at each CPU, we can see ksoftirqd is competing CPU with
> > fio (and effectively throttle other fio processes) (captured in
> > /sys/kernel/debug/tracing, echo sched:* >set_event)
> >
> > On CPU1 with patch: (note that the prev_state for fio is "R", it's
> preemptively scheduled)
> >             <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio
> prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1
> next_pid=17 next_prio=120
> >             <...>-17    [001] d... 66456.805859: sched_switch:
> prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==>
> next_comm=fio next_pid=4077 next_prio=120
> >             <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio
> prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1
> next_pid=17 next_prio=120
> >             <...>-17    [001] d... 66456.844607: sched_switch:
> prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==>
> next_comm=fio next_pid=4077 next_prio=120
> >
> > On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily
> scheduled)
> >            <idle>-0     [001] d...  6725.392308: sched_switch:
> prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==>
> next_comm=fio next_pid=14342 next_prio=120
> >               fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio
> prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1
> next_pid=0 next_prio=120
> >            <idle>-0     [001] d...  6725.392356: sched_switch:
> prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==>
> next_comm=fio next_pid=14342 next_prio=120
> >               fio-14342 [001] d...  6725.392425: sched_switch:
> > prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==>
> > next_comm=swapper/1 next_pid=0 next_prio=12

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
  2019-09-20 19:12                                                   ` Long Li
@ 2019-09-20 20:45                                                     ` Sagi Grimberg
  0 siblings, 0 replies; 64+ messages in thread
From: Sagi Grimberg @ 2019-09-20 20:45 UTC (permalink / raw)
  To: Long Li, Ming Lei
  Cc: Jens Axboe, Hannes Reinecke, John Garry, Bart Van Assche,
	linux-scsi, Peter Zijlstra, Daniel Lezcano, LKML, linux-nvme,
	Keith Busch, Ingo Molnar, Thomas Gleixner, Christoph Hellwig


>>> Sagi,
>>>
>>> Sorry it took a while to bring my system back online.
>>>
>>> With the patch, the IOPS is about the same drop with the 1st patch. I think
>> the excessive context switches are causing the drop in IOPS.
>>>
>>> The following are captured by "perf sched record" for 30 seconds during
>> tests.
>>>
>>> "perf sched latency"
>>> With patch:
>>>     fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123
>> ms | max at:    768.274023 s
>>>
>>> without patch:
>>>     fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446
>> ms | max at:   6447.310255 s
>>
>> Without patch means the proposed hard-irq patch?
> 
> It means the current upstream code without any patch. But It's prone to soft lockup.
> 
> Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup.

Thanks for the clarification.

The problem with what Ming is proposing in my mind (and its an existing
problem that exists today), is that nvme is taking precedence over
anything else until it absolutely cannot hog the cpu in hardirq.

In the thread Ming referenced a case where today if the cpu core has a
net softirq activity it cannot make forward progress. So with Ming's
suggestion, net softirq will eventually make progress, but it creates an
inherent fairness issue. Who said that nvme completions should come
faster then the net rx/tx or another I/O device (or hrtimers or sched
events...)?

As much as I'd like nvme to complete as soon as possible, I might have
other activities in the system that are as important if not more. So
I don't think we can solve this with something that is not cooperative
or fair with the rest of the system.

>> If we are context switching too much, it means the soft-irq operation is not
>> efficient, not necessarily the fact that the completion path is running in soft-
>> irq..
>>
>> Is your kernel compiled with full preemption or voluntary preemption?
> 
> The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters:
> 
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set

I see, so it still seems that irq_poll_softirq is still not efficient in
reaping completions. reaping the completions on its own is pretty much
the same in hard and soft irq, so its really the scheduling part that is
creating the overhead (which does not exist in hard irq).

Question:
when you test with without the patch (completions are coming in 
hard-irq), do the fio threads that run on the cpu cores that are 
assigned to the cores that are handling interrupts get substantially lower
throughput than the rest of the fio threads? I would expect that
the fio threads that are running on the first 32 cores to get very low
iops (overpowered by the nvme interrupts) and the rest doing much more
given that nvme has almost no limits to how much time it can spend on
processing completions.

If need_resched() is causing us to context switch too aggressively, does 
changing that to local_softirq_pending() make things better?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index d8eab563fa77..05d524fcaf04 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
                 /*
                  * If softirq window is exhausted then punt.
                  */
-               if (need_resched())
+               if (local_softirq_pending())
                         break;
         }
--

Although, this can potentially cause other threads from making forward
progress.. If it is better, perhaps we also need a time limit as well.

Perhaps we should add statistics/tracing on how many completions we are
reaping per invocation...

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, back to index

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27  8:53 [PATCH 0/4] genirq/nvme: add IRQF_RESCUE_THREAD for avoiding IRQ flood Ming Lei
2019-08-27  8:53 ` [PATCH 1/4] softirq: implement IRQ flood detection mechanism Ming Lei
2019-08-27 14:42   ` Thomas Gleixner
2019-08-27 16:19     ` Thomas Gleixner
2019-08-27 23:04       ` Ming Lei
2019-08-27 23:12         ` Thomas Gleixner
2019-08-27 22:58     ` Ming Lei
2019-08-27 23:09       ` Thomas Gleixner
2019-08-28 11:06         ` Ming Lei
2019-08-28 11:23           ` Thomas Gleixner
2019-08-28 13:50             ` Ming Lei
2019-08-28 14:07               ` Thomas Gleixner
2019-09-03  3:30                 ` Ming Lei
2019-09-03  5:59                   ` Daniel Lezcano
2019-09-03  6:31                     ` Ming Lei
2019-09-03  6:40                       ` Daniel Lezcano
2019-09-03  7:28                         ` Ming Lei
2019-09-03  7:50                           ` Daniel Lezcano
2019-09-03  9:30                             ` Ming Lei
2019-09-04 17:07                             ` Bart Van Assche
2019-09-04 17:31                               ` Daniel Lezcano
2019-09-04 17:38                                 ` Bart Van Assche
2019-09-04 18:02                                   ` Peter Zijlstra
2019-09-04 19:47                                     ` Bart Van Assche
2019-09-05  9:11                                       ` Ming Lei
2019-09-05  9:06                                 ` Ming Lei
2019-09-05 10:37                                   ` Daniel Lezcano
2019-09-06  1:22                                     ` Long Li
2019-09-06  4:36                                       ` Daniel Lezcano
2019-09-06  4:44                                         ` Long Li
2019-09-06  1:48                                     ` Ming Lei
2019-09-06  5:14                                       ` Daniel Lezcano
2019-09-06 18:30                                         ` Sagi Grimberg
2019-09-06 18:52                                           ` Keith Busch
2019-09-07  0:01                                           ` Ming Lei
2019-09-10  3:10                                             ` Sagi Grimberg
2019-09-18  0:00                                               ` Long Li
2019-09-20 17:14                                                 ` Sagi Grimberg
2019-09-20 19:12                                                   ` Long Li
2019-09-20 20:45                                                     ` Sagi Grimberg
2019-09-18 14:37                                               ` Ming Lei
2019-09-20 17:09                                                 ` Sagi Grimberg
2019-09-06 14:18                                       ` Keith Busch
2019-09-06 17:50                                         ` Long Li
2019-09-06 22:19                                           ` Ming Lei
2019-09-06 22:25                                             ` Keith Busch
2019-09-06 23:13                                               ` Ming Lei
2019-09-10  0:24                                             ` Ming Lei
2019-09-03  8:09                           ` Thomas Gleixner
2019-09-03  9:24                             ` Ming Lei
2019-08-29  6:15   ` Long Li
2019-08-30  0:55     ` Ming Lei
2019-08-27  8:53 ` [PATCH 2/4] genirq: add IRQF_RESCUE_THREAD Ming Lei
2019-08-27  8:53 ` [PATCH 3/4] nvme: pci: pass IRQF_RESCURE_THREAD to request_threaded_irq Ming Lei
2019-08-27  9:06   ` Johannes Thumshirn
2019-08-27  9:09     ` Ming Lei
2019-08-27  9:12       ` Johannes Thumshirn
2019-08-27 14:34       ` Keith Busch
2019-08-27 14:44         ` Keith Busch
2019-08-27 15:10   ` Bart Van Assche
2019-08-28  1:45     ` Ming Lei
2019-08-27  8:53 ` [PATCH 4/4] genirq: use irq's affinity for threaded irq with IRQF_RESCUE_THREAD Ming Lei
2019-08-27 14:35   ` Keith Busch
2019-09-06  8:50   ` John Garry

Linux-SCSI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-scsi/0 linux-scsi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-scsi linux-scsi/ https://lore.kernel.org/linux-scsi \
		linux-scsi@vger.kernel.org linux-scsi@archiver.kernel.org
	public-inbox-index linux-scsi


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-scsi


AGPL code for this site: git clone https://public-inbox.org/ public-inbox