linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] genirq: threadable IRQ support
@ 2016-06-15 13:42 Paolo Abeni
  2016-06-15 13:42 ` [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs Paolo Abeni
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

This patch series adds a new genirq interface to allows the user space to change
the IRQ mode at runtime, switching to and from the threaded mode.

The configuration is performing on a per irqaction basis, writing into the
newly added procfs entry /proc/irq/<nr>/<irq action name>/threaded. Such entry
is created at IRQ request time, only if CONFIG_IRQ_FORCED_THREADING
is defined.

Upon IRQ creation, the device handling such IRQ may optionally provide, via
the newly added API irq_set_mode_notifier(), an additional callback to be
notified about IRQ mode change.
The device can use such callback to configure its internal state to behave
differently in threaded mode and in normal mode if required.

Additional IRQ flags are added to let the device specifies some default
aspects of the IRQ thread. The device can request a SCHED_NORMAL scheduling
policy and avoid the affinity setting for the IRQ thread. Both of such
options are beneficial for the first threadable IRQ user.

The initial user for this feature is the networking subsystem; some
infrastructure is added to the network core for such goal. A new napi field
storing an IRQ thread reference is used to mark a NAPI instance as threaded
and __napi_schedule is modified to invoke a poll loop directly instead of
raising a softirq when the related NAPI instance is in threaded mode, plus 
a IRQ_mode_set callback is provided to notify the NAPI instance of the IRQ
mode change.

Each network device driver must be migrated explicitly to leverage the new
infrastructure. In this patch series, the Intel ixgbe is updated to invoke
irq_set_mode_notifier(), only when using msix IRQs. 
This avoids other IRQ events to be delayed indefinitely when the rx IRQ is
processed in thread mode. The default behavior after the driver migration is
unchanged.

Running the rx packets processing inside a conventional kthread is beneficial
for different workload since it allows the process scheduler to nicely use
the available resources. With multiqueue NICs, the ksoftirq design does not allow
any running process to use 100% of a single CPU, under relevant network load,
because the softirq poll loop will be scheduled on each CPU.

The above can be experienced in a hypervisor/VMs scenario, when the guest is
under UDP flood. If the hypervisor's NIC has enough rx queues the guest will
compete with ksoftirqd on each CPU. Moreover, since the ksoftirqd CPU
utilization change with the ingress traffic, the scheduler try to migrate the
guest processes towards the CPUs with the highest capacity, further impacting
the guest ability to process rx packets.

Running the hypervisor rx packet processing inside a migrable kthread allows
the process scheduler to let the guest process[es] to fully use a single a
core each, migrating some rx threads as required.

The raw numbers, obtained with the netperf UDP_STREAM test, using a tun
device with a noqueue qdisc in the hypervisor, and using random IP addresses
as source in case of multiple flows, are as follow:

		vanilla		threaded
size/flow	kpps		kpps/delta
1/1		824		843/+2%
1/25		736		906/+23%
1/50		752		906/+20%
1/100		772		906/+17%
1/200		741		976/+31%
64/1		829		840/+1%
64/25		711		932/+31%
64/50		780		894/+14%
64/100		754		946/+25%
64/200		714		945/+32%
256/1		702		510/-27%
256/25		724		894/+23%
256/50		739		889/+20%
256/100		798		873/+9%
256/200		812		907/+11%
1400/1		720		727/+1%
1400/25		826		826/0
1400/50		827		833/0
1400/100	820		820/0
1400/200	796		799/0

The guest runs 2vCPU, so it's not prone to the userspace livelock issue
recently exposed here: http://thread.gmane.org/gmane.linux.kernel/2218719

There are relevant improvement in all cpu bounded scenarios with multiple flows
and significant regression with medium size packet, single flow. The latter
is due to the increased 'burstiness' of packet processing which cause the
single socket in the guest of overflow more easily, if the receiver application
is scheduled on the same cpu processing the incoming packets.

The kthread approach should give a lot of new advantages over the softirq
based approach:

* moving into a more dpdk-alike busy poll packet processing direction:
we can even use busy polling without the need of a connected UDP or TCP
socket and can leverage busy polling for forwarding setups. This could
very well increase latency and packet throughput without hurting other
processes if the networking stack gets more and more preemptive in the
future.

* possibility to acquire mutexes in the networking processing path: e.g.
we would need that to configure hw_breakpoints if we want to add
watchpoints in the memory based on some rules in the kernel

* more and better tooling to adjust the weight of the networking
kthreads, preferring certain networking cards or setting cpus affinity
on packet processing threads. Maybe also using deadline scheduling or
other scheduler features might be worthwhile.

* scheduler statistics can be used to observe network packet processing



Paolo Abeni (5):
  genirq: implement support for runtime switch to threaded irqs
  genirq: add flags for controlling the default threaded irq behavior
  sched/preempt: cond_resched_softirq() must check for softirq
  netdev: implement infrastructure for threadable napi irq
  ixgbe: add support for threadable rx irq

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  14 +-
 include/linux/interrupt.h                     |  21 +++
 include/linux/netdevice.h                     |   4 +
 kernel/irq/internals.h                        |   3 +
 kernel/irq/manage.c                           | 212 ++++++++++++++++++++++++--
 kernel/irq/proc.c                             |  51 +++++++
 kernel/sched/core.c                           |   3 +-
 net/core/dev.c                                |  59 +++++++
 8 files changed, 355 insertions(+), 12 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs
  2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
@ 2016-06-15 13:42 ` Paolo Abeni
  2016-06-15 14:50   ` kbuild test robot
  2016-06-15 13:42 ` [PATCH 2/5] genirq: add flags for controlling the default threaded irq behavior Paolo Abeni
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

When the IRQ_FORCED_THREADING compile option is enabled, a new
new 'threaded' procfs entry is added under the action proc
directory upon irq request. Writing a true value onto
that file will cause the underlying action to be reconfigured
in a FORCE_THREADED mode.

The reconfiguration is performed disabling the irq underlaying
the current action, and then updating the action struct to the
specified mode, i.e. setting the thread field and the
IRQTF_FORCED_THREAD.

If en error occours before notifying the device, the
irq action is unmodified.

A device that wants to be notified about irq mode change,
can register a notifier with irq_set_mode_notifier(). Such
notifier will be invoked in atomic context just after each
irq reconfiguration.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/interrupt.h |  15 ++++
 kernel/irq/internals.h    |   3 +
 kernel/irq/manage.c       | 197 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/irq/proc.c         |  51 ++++++++++++
 4 files changed, 261 insertions(+), 5 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9fcabeb..85d3738 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -90,6 +90,7 @@ enum {
 };
 
 typedef irqreturn_t (*irq_handler_t)(int, void *);
+typedef void (*mode_notifier_t)(int, void *, struct task_struct *);
 
 /**
  * struct irqaction - per interrupt action descriptor
@@ -106,6 +107,8 @@ typedef irqreturn_t (*irq_handler_t)(int, void *);
  * @thread_flags:	flags related to @thread
  * @thread_mask:	bitmask for keeping track of @thread activity
  * @dir:	pointer to the proc/irq/NN/name entry
+ * @mode_notifier:	callback to notify the device about irq mode change
+ *		(threaded vs normal mode)
  */
 struct irqaction {
 	irq_handler_t		handler;
@@ -121,6 +124,7 @@ struct irqaction {
 	unsigned long		thread_mask;
 	const char		*name;
 	struct proc_dir_entry	*dir;
+	mode_notifier_t		mode_notifier;
 } ____cacheline_internodealigned_in_smp;
 
 extern irqreturn_t no_action(int cpl, void *dev_id);
@@ -212,6 +216,17 @@ extern void irq_wake_thread(unsigned int irq, void *dev_id);
 extern void suspend_device_irqs(void);
 extern void resume_device_irqs(void);
 
+#ifdef CONFIG_IRQ_FORCED_THREADING
+extern int irq_set_mode_notifier(unsigned int irq, void *dev_id,
+				 mode_notifier_t notifier);
+#else
+static inline int
+irq_set_mode_notifier(unsigned int irq, void *dev_id, mode_notifier_t notifier)
+{
+	return 0;
+}
+#endif
+
 /**
  * struct irq_affinity_notify - context for notification of IRQ affinity changes
  * @irq:		Interrupt to which notification applies
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 09be2c9..841c714 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -105,6 +105,9 @@ static inline void unregister_handler_proc(unsigned int irq,
 					   struct irqaction *action) { }
 #endif
 
+extern int irq_reconfigure(unsigned int irq, struct irqaction *act,
+			   bool threaded);
+
 extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask);
 
 extern void irq_set_thread_affinity(struct irq_desc *desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ef0bc02..cce4efd 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -938,8 +938,7 @@ static int irq_thread(void *data)
 	irqreturn_t (*handler_fn)(struct irq_desc *desc,
 			struct irqaction *action);
 
-	if (force_irqthreads && test_bit(IRQTF_FORCED_THREAD,
-					&action->thread_flags))
+	if (test_bit(IRQTF_FORCED_THREAD, &action->thread_flags))
 		handler_fn = irq_forced_thread_fn;
 	else
 		handler_fn = irq_thread_fn;
@@ -1052,8 +1051,8 @@ static void irq_release_resources(struct irq_desc *desc)
 		c->irq_release_resources(d);
 }
 
-static int
-setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
+static struct task_struct *
+create_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 {
 	struct task_struct *t;
 	struct sched_param param = {
@@ -1070,7 +1069,7 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 	}
 
 	if (IS_ERR(t))
-		return PTR_ERR(t);
+		return t;
 
 	sched_setscheduler_nocheck(t, SCHED_FIFO, &param);
 
@@ -1080,6 +1079,17 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 	 * references an already freed task_struct.
 	 */
 	get_task_struct(t);
+	return t;
+}
+
+static int
+setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
+{
+	struct task_struct *t = create_irq_thread(new, irq, secondary);
+
+	if (IS_ERR(t))
+		return PTR_ERR(t);
+
 	new->thread = t;
 	/*
 	 * Tell the thread to set its affinity. This is
@@ -1511,6 +1521,183 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 	return action;
 }
 
+#ifdef CONFIG_IRQ_FORCED_THREADING
+/*
+ * Internal function to reconfigure an irqaction - change it to
+ * threaded mode if the specified task struct is not NULL and vice versa
+ */
+void __irq_reconfigure_action(struct irq_desc *desc, struct irqaction *action,
+			      struct task_struct *t)
+{
+	action->flags &= ~IRQF_ONESHOT;
+	action->thread_mask = 0;
+	if (!t) {
+		if (action->thread_fn) {
+			action->handler = action->thread_fn;
+			action->thread_fn = NULL;
+		}
+		clear_bit(IRQTF_FORCED_THREAD, &action->thread_flags);
+		action->thread = NULL;
+		return;
+	}
+
+	/* Force the irq in threaded mode */
+	if (!action->thread_fn) {
+		action->thread_fn = action->handler;
+		action->handler = irq_default_primary_handler;
+	}
+
+	action->thread = t;
+	set_bit(IRQTF_FORCED_THREAD, &action->thread_flags);
+	set_bit(IRQTF_AFFINITY, &action->thread_flags);
+
+	if (!(desc->irq_data.chip->flags & IRQCHIP_ONESHOT_SAFE)) {
+		/*
+		 * We already ensured no other actions is registered on
+		 * this irq
+		 */
+		action->thread_mask = 1;
+		action->flags |= IRQF_ONESHOT;
+		desc->istate |= IRQS_ONESHOT;
+	}
+}
+
+/* Internal function to check if the specified irqaction can be threadable */
+static bool __irq_check_threadable(struct irq_desc *desc, struct irqaction *act)
+{
+	if (irq_settings_is_nested_thread(desc) ||
+	    !irq_settings_can_thread(desc))
+		return false;
+
+	/*
+	 * Enabling thread mode is going to set IRQF_ONESHOT, unless the irq
+	 * chip will help us; in the first case the irq can't be shared: the
+	 * only registered action can be the current one
+	 */
+	if (desc->irq_data.chip->flags & IRQCHIP_ONESHOT_SAFE)
+		return true;
+	return !desc->action ||
+	       (act && desc->action == act && act->next == NULL);
+}
+
+/* Internal function to configure the specified action threaded mode */
+int irq_reconfigure(unsigned int irq, struct irqaction *act, bool threaded)
+{
+	struct task_struct *thread = NULL, *old_thread = NULL;
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action;
+	int retval = -EINVAL;
+	unsigned long flags;
+
+	/*
+	 * Preallocate the kthread, so that we can update the action atomically
+	 * later
+	 */
+	if (threaded) {
+		old_thread = thread = create_irq_thread(act, irq, false);
+		if (IS_ERR(thread))
+			return PTR_ERR(thread);
+	}
+
+	disable_irq(irq);
+
+	chip_bus_lock(desc);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	/* Check for no-op under lock */
+	if (threaded == test_bit(IRQTF_FORCED_THREAD, &act->thread_flags))
+		goto unlock;
+
+	/* Even more pedantic check: look-up for our action */
+	for_each_action_of_desc(desc, action)
+		if (action->dev_id == act->dev_id)
+			break;
+	if (!action || action != act)
+		goto unlock;
+
+	/*
+	 * Check again for threadable constraints: the action list/desc
+	 * can be changed since the irq_set_threadable call
+	  */
+	if (!__irq_check_threadable(desc, action))
+		goto unlock;
+
+	old_thread = action->thread;
+	__irq_reconfigure_action(desc, action, thread);
+
+	if (action->mode_notifier)
+		action->mode_notifier(action->irq, action->dev_id, thread);
+	retval = 0;
+
+unlock:
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	chip_bus_sync_unlock(desc);
+
+	if (old_thread) {
+		kthread_stop(old_thread);
+		put_task_struct(old_thread);
+	}
+
+	if (retval)
+		pr_err("can't change configuration for irq %d: %d\n", irq,
+		       retval);
+
+	enable_irq(irq);
+	return retval;
+}
+
+/**
+ *	irq_set_mode_notifier - register a mode change notifier
+ *	@irq: Interrupt line
+ *	@dev_id: The cookie used to identify the irq handler and passed back
+ *		 to the notifier
+ *	@mode_notifier: The callback to be registered
+ *
+ *	This call registers a callback to notify the device about irq mode
+ *	change (threaded/normal mode). Mode change are triggered writing on
+ *	the 'threaded' procfs entry.
+ *	When running in threaded mode the irq thread task struct will be passed
+ *	to the notifer, or NULL elsewhere. It's up to the device update its
+ *	internal state accordingly
+ */
+int irq_set_mode_notifier(unsigned int irq, void *dev_id,
+			  mode_notifier_t notifier)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	struct irqaction *action;
+	unsigned long flags;
+	int ret = -EINVAL;
+
+	if (!desc)
+		return ret;
+
+	chip_bus_lock(desc);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	for_each_action_of_desc(desc, action)
+		if (action->dev_id == dev_id)
+			break;
+
+	if (!action || action->mode_notifier)
+		goto out;
+
+	/*
+	 * Sync current status, so that the device is fine if the irq has been
+	 * reconfigured before the notifer is registered
+	 */
+	action->mode_notifier = notifier;
+	if (notifier)
+		notifier(action->irq, action->dev_id, action->thread);
+	ret = 0;
+
+out:
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	chip_bus_sync_unlock(desc);
+	return ret;
+}
+EXPORT_SYMBOL(irq_set_mode_notifier);
+#endif
+
 /**
  *	remove_irq - free an interrupt
  *	@irq: Interrupt line to free
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4e1b947..01f155b 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -302,6 +302,51 @@ static int name_unique(unsigned int irq, struct irqaction *new_action)
 	return ret;
 }
 
+#ifdef CONFIG_IRQ_FORCED_THREADING
+static int irqaction_threaded_proc_show(struct seq_file *m, void *v)
+{
+	struct irqaction *action = (struct irqaction *)m->private;
+
+	seq_printf(m, "%d\n",
+		   test_bit(IRQTF_FORCED_THREAD, &action->thread_flags));
+	return 0;
+}
+
+static int irqaction_threaded_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irqaction_threaded_proc_show, PDE_DATA(inode));
+}
+
+static ssize_t irqaction_threaded_proc_write(struct file *file,
+					     const char __user *buf,
+					     size_t count, loff_t *offs)
+{
+	struct irqaction *action = PDE_DATA(file_inode(file));
+	bool threaded;
+	size_t ret;
+
+	ret = kstrtobool_from_user(buf, count, &threaded);
+	if (ret)
+		return ret;
+
+	if (threaded == test_bit(IRQTF_FORCED_THREAD, &action->thread_flags))
+		goto out;
+
+	irq_reconfigure(action->irq, action, threaded);
+
+out:
+	return count;
+}
+
+static const struct file_operations irqaction_threaded_proc_fops = {
+	.open		= irqaction_threaded_proc_open,
+	.read		= seq_read,
+	.write		= irqaction_threaded_proc_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
 void register_handler_proc(unsigned int irq, struct irqaction *action)
 {
 	char name [MAX_NAMELEN];
@@ -316,8 +361,14 @@ void register_handler_proc(unsigned int irq, struct irqaction *action)
 
 	/* create /proc/irq/1234/handler/ */
 	action->dir = proc_mkdir(name, desc->dir);
+#ifdef CONFIG_IRQ_FORCED_THREADING
+	if (action->dir)
+		proc_create_data("threaded", 0644, action->dir,
+				 &irqaction_threaded_proc_fops, (void *)action);
+#endif
 }
 
+
 #undef MAX_NAMELEN
 
 #define MAX_NAMELEN 10
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] genirq: add flags for controlling the default threaded irq behavior
  2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
  2016-06-15 13:42 ` [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs Paolo Abeni
@ 2016-06-15 13:42 ` Paolo Abeni
  2016-06-15 13:42 ` [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq Paolo Abeni
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

A threadable irq can benefit from irq_set_affinity when running
in non threaded mode and prefer running unbounded to cpu when in
threaded mode. Setting the IRQF_TH_NO_AFFINITY flag on irq
registration allow the irq to achieve both behaviors.

A long running threaded irq can starve the system if scheduled under
SCHED_FIFO. Setting the IRQF_TH_SCHED_NORMAL flag on irq will cause
the irq thread to run by default under the SCHED_NORMAL scheduler.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/interrupt.h |  6 ++++++
 kernel/irq/manage.c       | 17 +++++++++++------
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 85d3738..33c3033 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,10 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_TH_SCHED_NORMAL - If the IRQ is threaded, it will use SCHED_NORMAL,
+ *                instead the default SCHED_FIFO scheduler
+ * IRQF_TH_NO_AFFINITY - If the IRQ is threaded, the affinity hint will not be
+ *                enforced in the IRQ thread
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +78,8 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_TH_SCHED_NORMAL	0x00080000
+#define IRQF_TH_NO_AFFINITY	0x00100000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index cce4efd..d695e12 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1055,9 +1055,7 @@ static struct task_struct *
 create_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 {
 	struct task_struct *t;
-	struct sched_param param = {
-		.sched_priority = MAX_USER_RT_PRIO/2,
-	};
+	struct sched_param param;
 
 	if (!secondary) {
 		t = kthread_create(irq_thread, new, "irq/%d-%s", irq,
@@ -1071,7 +1069,12 @@ create_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 	if (IS_ERR(t))
 		return t;
 
-	sched_setscheduler_nocheck(t, SCHED_FIFO, &param);
+	if (new->flags & IRQF_TH_SCHED_NORMAL) {
+		sched_setscheduler_nocheck(t, SCHED_NORMAL, &param);
+	} else {
+		param.sched_priority = MAX_USER_RT_PRIO/2;
+		sched_setscheduler_nocheck(t, SCHED_FIFO, &param);
+	}
 
 	/*
 	 * We keep the reference to the task struct even if
@@ -1100,7 +1103,8 @@ setup_irq_thread(struct irqaction *new, unsigned int irq, bool secondary)
 	 * correct as we want the thread to move to the cpu(s)
 	 * on which the requesting code placed the interrupt.
 	 */
-	set_bit(IRQTF_AFFINITY, &new->thread_flags);
+	if (!(new->flags & IRQF_TH_NO_AFFINITY))
+		set_bit(IRQTF_AFFINITY, &new->thread_flags);
 	return 0;
 }
 
@@ -1549,7 +1553,8 @@ void __irq_reconfigure_action(struct irq_desc *desc, struct irqaction *action,
 
 	action->thread = t;
 	set_bit(IRQTF_FORCED_THREAD, &action->thread_flags);
-	set_bit(IRQTF_AFFINITY, &action->thread_flags);
+	if (!(action->flags & IRQF_TH_NO_AFFINITY))
+		set_bit(IRQTF_AFFINITY, &action->thread_flags);
 
 	if (!(desc->irq_data.chip->flags & IRQCHIP_ONESHOT_SAFE)) {
 		/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq
  2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
  2016-06-15 13:42 ` [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs Paolo Abeni
  2016-06-15 13:42 ` [PATCH 2/5] genirq: add flags for controlling the default threaded irq behavior Paolo Abeni
@ 2016-06-15 13:42 ` Paolo Abeni
  2016-06-15 13:48   ` Peter Zijlstra
  2016-06-15 13:42 ` [PATCH 4/5] netdev: implement infrastructure for threadable napi irq Paolo Abeni
  2016-06-15 13:42 ` [PATCH 5/5] ixgbe: add support for threadable rx irq Paolo Abeni
  4 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

Currently cond_resched_softirq() fails to reschedule if there
are pending softirq but no other running process. This happens
i.e. when receiving an interrupt with local bh disabled.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f2cae4..788625f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4837,7 +4837,8 @@ int __sched __cond_resched_softirq(void)
 {
 	BUG_ON(!in_softirq());
 
-	if (should_resched(SOFTIRQ_DISABLE_OFFSET)) {
+	if (should_resched(SOFTIRQ_DISABLE_OFFSET) ||
+	    local_softirq_pending()) {
 		local_bh_enable();
 		preempt_schedule_common();
 		local_bh_disable();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
                   ` (2 preceding siblings ...)
  2016-06-15 13:42 ` [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq Paolo Abeni
@ 2016-06-15 13:42 ` Paolo Abeni
  2016-06-15 14:12   ` kbuild test robot
  2016-06-15 14:17   ` Eric Dumazet
  2016-06-15 13:42 ` [PATCH 5/5] ixgbe: add support for threadable rx irq Paolo Abeni
  4 siblings, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

This commit adds the infrastructure needed for threadable
rx interrupt. A reference to the irq thread is used to
mark the threaded irq mode.
In threaded mode the poll loop is invoked directly from
__napi_schedule().
napi drivers which want to support threadable irq interrupts
must provide an irq mode change handler which actually set
napi->thread and register it after requesting the irq.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/netdevice.h |  4 ++++
 net/core/dev.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d101e4d..5da53be 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -322,6 +322,9 @@ struct napi_struct {
 	struct list_head	dev_list;
 	struct hlist_node	napi_hash_node;
 	unsigned int		napi_id;
+#ifdef CONFIG_IRQ_FORCED_THREADING
+	struct task_struct	*thread;
+#endif
 };
 
 enum {
@@ -330,6 +333,7 @@ enum {
 	NAPI_STATE_NPSVC,	/* Netpoll - don't dequeue from poll_list */
 	NAPI_STATE_HASHED,	/* In NAPI hash (busy polling possible) */
 	NAPI_STATE_NO_BUSY_POLL,/* Do not add in napi_hash, no busy polling */
+	NAPI_STATE_SCHED_THREAD, /* The poll thread is scheduled */
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index b148357..40ea1e7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -93,6 +93,7 @@
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 #include <linux/notifier.h>
+#include <linux/kthread.h>
 #include <linux/skbuff.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -3453,10 +3454,68 @@ int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 int weight_p __read_mostly = 64;            /* old backlog weight */
 
+#if CONFIG_IRQ_FORCED_THREADING
+static int napi_poll(struct napi_struct *n, struct list_head *repoll);
+
+static void napi_threaded_poll(struct napi_struct *napi)
+{
+	unsigned long time_limit = jiffies + 2;
+	struct list_head dummy_repoll;
+	int budget = netdev_budget;
+	bool again = true;
+
+	if (test_and_set_bit(NAPI_STATE_SCHED_THREAD, &napi->state))
+		return;
+
+	local_irq_enable();
+	INIT_LIST_HEAD(&dummy_repoll);
+
+	while (again) {
+		/* ensure that the poll list is not empty */
+		if (list_empty(&dummy_repoll))
+			list_add(&napi->poll_list, &dummy_repoll);
+
+		budget -= napi_poll(napi, &dummy_repoll);
+
+		if (napi_disable_pending(napi))
+			again = false;
+		else if (!test_bit(NAPI_STATE_SCHED, &napi->state))
+			again = false;
+		else if (kthread_should_stop())
+			again = false;
+
+		if (!again || unlikely(budget <= 0 ||
+				       time_after_eq(jiffies, time_limit))) {
+			/* no need to reschedule if we are going to stop */
+			if (again)
+				cond_resched_softirq();
+			time_limit = jiffies + 2;
+			budget = netdev_budget;
+			rcu_bh_qs();
+			__kfree_skb_flush();
+		}
+	}
+
+	clear_bit(NAPI_STATE_SCHED_THREAD, &napi->state);
+	local_irq_disable();
+}
+
+static inline bool napi_is_threaded(struct napi_struct *napi)
+{
+	return current == napi->thread;
+}
+#else
+#define napi_is_threaded(napi) 0
+#endif
+
 /* Called with irq disabled */
 static inline void ____napi_schedule(struct softnet_data *sd,
 				     struct napi_struct *napi)
 {
+	if (napi_is_threaded(napi)) {
+		napi_threaded_poll(napi);
+		return;
+	}
 	list_add_tail(&napi->poll_list, &sd->poll_list);
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] ixgbe: add support for threadable rx irq
  2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
                   ` (3 preceding siblings ...)
  2016-06-15 13:42 ` [PATCH 4/5] netdev: implement infrastructure for threadable napi irq Paolo Abeni
@ 2016-06-15 13:42 ` Paolo Abeni
  4 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 13:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, David S. Miller, Eric Dumazet, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

Plug-in the threadable irq infrastructure to allow run-time
configuration of rx irqs, when msix irqs are used.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 088c47c..d9a591c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2890,6 +2890,14 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 	return 0;
 }
 
+static void ixgbe_irq_mode_notifier(int irq, void *data,
+				    struct task_struct *irq_thread)
+{
+	struct ixgbe_q_vector *q_vector = (struct ixgbe_q_vector *)data;
+
+	q_vector->napi.thread = irq_thread;
+}
+
 /**
  * ixgbe_request_msix_irqs - Initialize MSI-X interrupts
  * @adapter: board private structure
@@ -2921,8 +2929,12 @@ static int ixgbe_request_msix_irqs(struct ixgbe_adapter *adapter)
 			/* skip this unused q_vector */
 			continue;
 		}
-		err = request_irq(entry->vector, &ixgbe_msix_clean_rings, 0,
+		err = request_irq(entry->vector, &ixgbe_msix_clean_rings,
+				  IRQF_TH_NO_AFFINITY | IRQF_TH_SCHED_NORMAL,
 				  q_vector->name, q_vector);
+		if (!err)
+			err = irq_set_mode_notifier(entry->vector, q_vector,
+						    ixgbe_irq_mode_notifier);
 		if (err) {
 			e_err(probe, "request_irq failed for MSIX interrupt "
 			      "Error: %d\n", err);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq
  2016-06-15 13:42 ` [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq Paolo Abeni
@ 2016-06-15 13:48   ` Peter Zijlstra
  2016-06-15 14:00     ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2016-06-15 13:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-kernel, Thomas Gleixner, David S. Miller, Eric Dumazet,
	Steven Rostedt, Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, Jun 15, 2016 at 03:42:04PM +0200, Paolo Abeni wrote:
> Currently cond_resched_softirq() fails to reschedule if there
> are pending softirq but no other running process. This happens
> i.e. when receiving an interrupt with local bh disabled.
> 
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

All your patches appear to have this broken SoB chain.

As presented it suggests you wrote the patches, which matches with From,
however it then suggests Hannes collected and send them onwards, not so
much.

Please correct.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq
  2016-06-15 13:48   ` Peter Zijlstra
@ 2016-06-15 14:00     ` Paolo Abeni
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, David S. Miller, Eric Dumazet,
	Steven Rostedt, Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, 2016-06-15 at 15:48 +0200, Peter Zijlstra wrote:
> On Wed, Jun 15, 2016 at 03:42:04PM +0200, Paolo Abeni wrote:
> > Currently cond_resched_softirq() fails to reschedule if there
> > are pending softirq but no other running process. This happens
> > i.e. when receiving an interrupt with local bh disabled.
> > 
> > Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> All your patches appear to have this broken SoB chain.
> 
> As presented it suggests you wrote the patches, which matches with From,
> however it then suggests Hannes collected and send them onwards, not so
> much.
> 
> Please correct.

My bad. I'll re-submit. The intention was to specify this is joint work
done together with Hannes.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 13:42 ` [PATCH 4/5] netdev: implement infrastructure for threadable napi irq Paolo Abeni
@ 2016-06-15 14:12   ` kbuild test robot
  2016-06-15 14:17   ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2016-06-15 14:12 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: kbuild-all, linux-kernel, Thomas Gleixner, David S. Miller,
	Eric Dumazet, Steven Rostedt, Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

Hi,

[auto build test ERROR on tip/irq/core]
[also build test ERROR on v4.7-rc3 next-20160615]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Paolo-Abeni/genirq-threadable-IRQ-support/20160615-214836
config: cris-etrax-100lx_v2_defconfig (attached as .config)
compiler: cris-linux-gcc (GCC) 4.6.3
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=cris 

All error/warnings (new ones prefixed by >>):

>> net/core/dev.c:3457:5: warning: "CONFIG_IRQ_FORCED_THREADING" is not defined [-Wundef]
   net/core/dev.c: In function '____napi_schedule':
>> net/core/dev.c:3516:3: error: implicit declaration of function 'napi_threaded_poll' [-Werror=implicit-function-declaration]
   cc1: some warnings being treated as errors

vim +/napi_threaded_poll +3516 net/core/dev.c

  3451	EXPORT_SYMBOL(netdev_max_backlog);
  3452	
  3453	int netdev_tstamp_prequeue __read_mostly = 1;
  3454	int netdev_budget __read_mostly = 300;
  3455	int weight_p __read_mostly = 64;            /* old backlog weight */
  3456	
> 3457	#if CONFIG_IRQ_FORCED_THREADING
  3458	static int napi_poll(struct napi_struct *n, struct list_head *repoll);
  3459	
  3460	static void napi_threaded_poll(struct napi_struct *napi)
  3461	{
  3462		unsigned long time_limit = jiffies + 2;
  3463		struct list_head dummy_repoll;
  3464		int budget = netdev_budget;
  3465		bool again = true;
  3466	
  3467		if (test_and_set_bit(NAPI_STATE_SCHED_THREAD, &napi->state))
  3468			return;
  3469	
  3470		local_irq_enable();
  3471		INIT_LIST_HEAD(&dummy_repoll);
  3472	
  3473		while (again) {
  3474			/* ensure that the poll list is not empty */
  3475			if (list_empty(&dummy_repoll))
  3476				list_add(&napi->poll_list, &dummy_repoll);
  3477	
  3478			budget -= napi_poll(napi, &dummy_repoll);
  3479	
  3480			if (napi_disable_pending(napi))
  3481				again = false;
  3482			else if (!test_bit(NAPI_STATE_SCHED, &napi->state))
  3483				again = false;
  3484			else if (kthread_should_stop())
  3485				again = false;
  3486	
  3487			if (!again || unlikely(budget <= 0 ||
  3488					       time_after_eq(jiffies, time_limit))) {
  3489				/* no need to reschedule if we are going to stop */
  3490				if (again)
  3491					cond_resched_softirq();
  3492				time_limit = jiffies + 2;
  3493				budget = netdev_budget;
  3494				rcu_bh_qs();
  3495				__kfree_skb_flush();
  3496			}
  3497		}
  3498	
  3499		clear_bit(NAPI_STATE_SCHED_THREAD, &napi->state);
  3500		local_irq_disable();
  3501	}
  3502	
  3503	static inline bool napi_is_threaded(struct napi_struct *napi)
  3504	{
  3505		return current == napi->thread;
  3506	}
  3507	#else
  3508	#define napi_is_threaded(napi) 0
  3509	#endif
  3510	
  3511	/* Called with irq disabled */
  3512	static inline void ____napi_schedule(struct softnet_data *sd,
  3513					     struct napi_struct *napi)
  3514	{
  3515		if (napi_is_threaded(napi)) {
> 3516			napi_threaded_poll(napi);
  3517			return;
  3518		}
  3519		list_add_tail(&napi->poll_list, &sd->poll_list);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 8526 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 13:42 ` [PATCH 4/5] netdev: implement infrastructure for threadable napi irq Paolo Abeni
  2016-06-15 14:12   ` kbuild test robot
@ 2016-06-15 14:17   ` Eric Dumazet
  2016-06-15 14:21     ` Eric Dumazet
  2016-06-15 16:42     ` Paolo Abeni
  1 sibling, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2016-06-15 14:17 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, Jun 15, 2016 at 6:42 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> This commit adds the infrastructure needed for threadable
> rx interrupt. A reference to the irq thread is used to
> mark the threaded irq mode.
> In threaded mode the poll loop is invoked directly from
> __napi_schedule().
> napi drivers which want to support threadable irq interrupts
> must provide an irq mode change handler which actually set
> napi->thread and register it after requesting the irq.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  include/linux/netdevice.h |  4 ++++
>  net/core/dev.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
>

I really appreciate the effort, but as I already said this is not going to work.

Many NIC have 2 NAPI contexts per queue, one for TX, one for RX.

Relying on CFS to switch from the two 'threads' you need in the one
vCPU case will add latencies that your 'pure throughput UDP flood' is
not able to detect.

I was waiting a fix from Andy Lutomirski to be merged before sending
my ksoftirqd fix, which will work and wont bring kernel bloat.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 14:17   ` Eric Dumazet
@ 2016-06-15 14:21     ` Eric Dumazet
  2016-06-15 16:42     ` Paolo Abeni
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2016-06-15 14:21 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev, Andy Lutomirski

On Wed, Jun 15, 2016 at 7:17 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Wed, Jun 15, 2016 at 6:42 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>> This commit adds the infrastructure needed for threadable
>> rx interrupt. A reference to the irq thread is used to
>> mark the threaded irq mode.
>> In threaded mode the poll loop is invoked directly from
>> __napi_schedule().
>> napi drivers which want to support threadable irq interrupts
>> must provide an irq mode change handler which actually set
>> napi->thread and register it after requesting the irq.
>>
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> ---
>>  include/linux/netdevice.h |  4 ++++
>>  net/core/dev.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 63 insertions(+)
>>
>
> I really appreciate the effort, but as I already said this is not going to work.
>
> Many NIC have 2 NAPI contexts per queue, one for TX, one for RX.
>
> Relying on CFS to switch from the two 'threads' you need in the one
> vCPU case will add latencies that your 'pure throughput UDP flood' is
> not able to detect.
>
> I was waiting a fix from Andy Lutomirski to be merged before sending
> my ksoftirqd fix, which will work and wont bring kernel bloat.


Andy's patch was"x86/traps: Don't force in_interrupt() to return true
in IST handlers"

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs
  2016-06-15 13:42 ` [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs Paolo Abeni
@ 2016-06-15 14:50   ` kbuild test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2016-06-15 14:50 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: kbuild-all, linux-kernel, Thomas Gleixner, David S. Miller,
	Eric Dumazet, Steven Rostedt, Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

[-- Attachment #1: Type: text/plain, Size: 2510 bytes --]

Hi,

[auto build test WARNING on tip/irq/core]
[also build test WARNING on v4.7-rc3 next-20160615]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Paolo-Abeni/genirq-threadable-IRQ-support/20160615-214836
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

>> kernel/irq/manage.c:1681: warning: No description found for parameter 'notifier'
>> kernel/irq/manage.c:1681: warning: Excess function parameter 'mode_notifier' description in 'irq_set_mode_notifier'
   kernel/irq/handle.c:1: warning: no structured comments found
--
   lib/crc32.c:148: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:148: warning: Excess function parameter 'tab' description in 'crc32_le_generic'
   lib/crc32.c:293: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:293: warning: Excess function parameter 'tab' description in 'crc32_be_generic'
   lib/crc32.c:1: warning: no structured comments found
   mm/memory.c:2881: warning: No description found for parameter 'old'
>> kernel/irq/manage.c:1681: warning: No description found for parameter 'notifier'
>> kernel/irq/manage.c:1681: warning: Excess function parameter 'mode_notifier' description in 'irq_set_mode_notifier'

vim +/notifier +1681 kernel/irq/manage.c

  1665	/**
  1666	 *	irq_set_mode_notifier - register a mode change notifier
  1667	 *	@irq: Interrupt line
  1668	 *	@dev_id: The cookie used to identify the irq handler and passed back
  1669	 *		 to the notifier
  1670	 *	@mode_notifier: The callback to be registered
  1671	 *
  1672	 *	This call registers a callback to notify the device about irq mode
  1673	 *	change (threaded/normal mode). Mode change are triggered writing on
  1674	 *	the 'threaded' procfs entry.
  1675	 *	When running in threaded mode the irq thread task struct will be passed
  1676	 *	to the notifer, or NULL elsewhere. It's up to the device update its
  1677	 *	internal state accordingly
  1678	 */
  1679	int irq_set_mode_notifier(unsigned int irq, void *dev_id,
  1680				  mode_notifier_t notifier)
> 1681	{
  1682		struct irq_desc *desc = irq_to_desc(irq);
  1683		struct irqaction *action;
  1684		unsigned long flags;
  1685		int ret = -EINVAL;
  1686	
  1687		if (!desc)
  1688			return ret;
  1689	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6370 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 14:17   ` Eric Dumazet
  2016-06-15 14:21     ` Eric Dumazet
@ 2016-06-15 16:42     ` Paolo Abeni
  2016-06-15 17:04       ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2016-06-15 16:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, 2016-06-15 at 07:17 -0700, Eric Dumazet wrote:
> On Wed, Jun 15, 2016 at 6:42 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > This commit adds the infrastructure needed for threadable
> > rx interrupt. A reference to the irq thread is used to
> > mark the threaded irq mode.
> > In threaded mode the poll loop is invoked directly from
> > __napi_schedule().
> > napi drivers which want to support threadable irq interrupts
> > must provide an irq mode change handler which actually set
> > napi->thread and register it after requesting the irq.
> >
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > ---
> >  include/linux/netdevice.h |  4 ++++
> >  net/core/dev.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 63 insertions(+)
> >
> 
> I really appreciate the effort, but as I already said this is not going to work.
> 
> Many NIC have 2 NAPI contexts per queue, one for TX, one for RX.
> 
> Relying on CFS to switch from the two 'threads' you need in the one
> vCPU case will add latencies that your 'pure throughput UDP flood' is
> not able to detect.

We have done TCP_RR tests with similar results: when the throughput is
(guest) cpu bounded and multiple flows are used, there is measurable
gain.

> I was waiting a fix from Andy Lutomirski to be merged before sending
> my ksoftirqd fix, which will work and wont bring kernel bloat.

We experimented that patch in this scenario, but it don't give
measurable gain, since the ksoftirqd threads still prevent the qemu
process from using 100% of any hypervisor's cores.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 16:42     ` Paolo Abeni
@ 2016-06-15 17:04       ` Eric Dumazet
  2016-06-16 10:39         ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2016-06-15 17:04 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, Jun 15, 2016 at 9:42 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Wed, 2016-06-15 at 07:17 -0700, Eric Dumazet wrote:

>>
>> I really appreciate the effort, but as I already said this is not going to work.
>>
>> Many NIC have 2 NAPI contexts per queue, one for TX, one for RX.
>>
>> Relying on CFS to switch from the two 'threads' you need in the one
>> vCPU case will add latencies that your 'pure throughput UDP flood' is
>> not able to detect.
>
> We have done TCP_RR tests with similar results: when the throughput is
> (guest) cpu bounded and multiple flows are used, there is measurable
> gain.

TCP_RR hardly triggers the problem I am mentioning.

You need a combination of different competing works. Both bulk and rpc like.

The important factor for RPC is P99 latency.

Look, the simple fact that mlx4 driver can dequeue 256 skb per TX napi poll
and only 64 skbs in RX poll is problematic in some workloads, since
this allows a queue to build up on RX rings.

>
>> I was waiting a fix from Andy Lutomirski to be merged before sending
>> my ksoftirqd fix, which will work and wont bring kernel bloat.
>
> We experimented that patch in this scenario, but it don't give
> measurable gain, since the ksoftirqd threads still prevent the qemu
> process from using 100% of any hypervisor's cores.

Not sure what you measured, but in my experiment, the user thread
could finally get a fair share of the core, instead of 0%

Improvement was 100000 % or so.

How are you making sure your thread uses say 1% of the core, and let
99% to the 'qemu' process exactly ?

How the typical user will enable all this stuff exactly ?

All I am saying is that you add a complex infra, that will need a lot
of tweaks and considerable maintenance burden,
instead of fixing the existing one _first_.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-15 17:04       ` Eric Dumazet
@ 2016-06-16 10:39         ` Paolo Abeni
  2016-06-16 11:19           ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2016-06-16 10:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Wed, 2016-06-15 at 10:04 -0700, Eric Dumazet wrote:
> On Wed, Jun 15, 2016 at 9:42 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > On Wed, 2016-06-15 at 07:17 -0700, Eric Dumazet wrote:
> 
> >>
> >> I really appreciate the effort, but as I already said this is not going to work.
> >>
> >> Many NIC have 2 NAPI contexts per queue, one for TX, one for RX.
> >>
> >> Relying on CFS to switch from the two 'threads' you need in the one
> >> vCPU case will add latencies that your 'pure throughput UDP flood' is
> >> not able to detect.
> >
> > We have done TCP_RR tests with similar results: when the throughput is
> > (guest) cpu bounded and multiple flows are used, there is measurable
> > gain.
> 
> TCP_RR hardly triggers the problem I am mentioning.
> 
> You need a combination of different competing works. Both bulk and rpc like.
> 
> The important factor for RPC is P99 latency.
> 
> Look, the simple fact that mlx4 driver can dequeue 256 skb per TX napi poll
> and only 64 skbs in RX poll is problematic in some workloads, since
> this allows a queue to build up on RX rings.
> 
> >
> >> I was waiting a fix from Andy Lutomirski to be merged before sending
> >> my ksoftirqd fix, which will work and wont bring kernel bloat.
> >
> > We experimented that patch in this scenario, but it don't give
> > measurable gain, since the ksoftirqd threads still prevent the qemu
> > process from using 100% of any hypervisor's cores.
> 
> Not sure what you measured, but in my experiment, the user thread
> could finally get a fair share of the core, instead of 0%
> 
> Improvement was 100000 % or so.

We used a different setup to explicitly avoid the (guest) userspace
starvation issue. Using a guest with 2vCPUs (or more) and a single queue
avoids the starvation issue, because the scheduler moves the user space
processes on a different vCPU in respect to the ksoftirqd thread.

In the hypervisor, with a vanilla kernel, the qemu process receives a
fair share of the cpu time, but considerably less 100%, and his
performances are bounded to a considerable lower throughput than the
theoretical one.

We tested your patch in both the guest and/or the hypervisor with the
above scenario and it doesn't change the throughput numbers much. But it
fixes nicely the starvation issue on single core host and we are
definitely in favor of it and waiting to get it included.

> How are you making sure your thread uses say 1% of the core, and let
> 99% to the 'qemu' process exactly ?

We allow the irq thread to be migrated. The scheduler can move it on a
different (hypervisor) core according to the workload, and qemu can
avoid completely competing with other processes for a cpu.

We are not using the threaded irqs in the guest, only into the
hypervisor.

> How the typical user will enable all this stuff exactly ?

A desktop host or a bare-metal server don't probably need/want it. An
hypervisor or an (small) router would probably enable irq threading on
all supported NICs. That could be managed by the tuned daemon or the
like with an appropriate profile.
Advanced users, also real time sensitive users, can simply use the
procfs now.

kernel without IRQ_FORCED_THREADING are unaffected, kernel with
IRQ_FORCED_THREADING can already change the packet reception (and more)
in a significant way with the forcedirq parameter.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-16 10:39         ` Paolo Abeni
@ 2016-06-16 11:19           ` Eric Dumazet
  2016-06-16 12:03             ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2016-06-16 11:19 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Thu, Jun 16, 2016 at 3:39 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> We used a different setup to explicitly avoid the (guest) userspace
> starvation issue. Using a guest with 2vCPUs (or more) and a single queue
> avoids the starvation issue, because the scheduler moves the user space
> processes on a different vCPU in respect to the ksoftirqd thread.
>
> In the hypervisor, with a vanilla kernel, the qemu process receives a
> fair share of the cpu time, but considerably less 100%, and his
> performances are bounded to a considerable lower throughput than the
> theoretical one.
>

Completely different setup than last time. I am kind of lost.

Are you trying to find the optimal way to demonstrate your patch can be useful ?

In a case with 2 vcpus, then the _standard_ kernel will migrate the
user thread on the cpu not used by the IRQ,
once process scheduler can see two threads competing on one cpu
(ksoftirqd and the user thread), and the other cpu being idle.

Trying to shift the IRQ 'thread' is not nice, since the hardware IRQ
will be delivered on the wrong cpu.

Unless user space forces cpu pinning ? Then tell the user it should not.

The natural choice is to put both producer and consumer on same cpu
for cache locality reasons (wake affine),
but in stress mode allow to run the consumer on another cpu if available.

If the process scheduler fails to migrate the producer, then there is
a bug needing to be fixed.

Trying to migrate the producer, while hardware IRQ are generally stick
to one cpu is counter intuitive and source of reorders.

(Think of tunneling processing, re-injecting packets to the stack with
netif_rx())

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-16 11:19           ` Eric Dumazet
@ 2016-06-16 12:03             ` Paolo Abeni
  2016-06-16 16:55               ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2016-06-16 12:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

On Thu, 2016-06-16 at 04:19 -0700, Eric Dumazet wrote:
> On Thu, Jun 16, 2016 at 3:39 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > We used a different setup to explicitly avoid the (guest) userspace
> > starvation issue. Using a guest with 2vCPUs (or more) and a single queue
> > avoids the starvation issue, because the scheduler moves the user space
> > processes on a different vCPU in respect to the ksoftirqd thread.
> >
> > In the hypervisor, with a vanilla kernel, the qemu process receives a
> > fair share of the cpu time, but considerably less 100%, and his
> > performances are bounded to a considerable lower throughput than the
> > theoretical one.
> >
> 
> Completely different setup than last time. I am kind of lost.
> 
> Are you trying to find the optimal way to demonstrate your patch can be useful ?
> 
> In a case with 2 vcpus, then the _standard_ kernel will migrate the
> user thread on the cpu not used by the IRQ,
> once process scheduler can see two threads competing on one cpu
> (ksoftirqd and the user thread), and the other cpu being idle.
> 
> Trying to shift the IRQ 'thread' is not nice, since the hardware IRQ
> will be delivered on the wrong cpu.
> 
> Unless user space forces cpu pinning ? Then tell the user it should not.
> 
> The natural choice is to put both producer and consumer on same cpu
> for cache locality reasons (wake affine),
> but in stress mode allow to run the consumer on another cpu if available.
> 
> If the process scheduler fails to migrate the producer, then there is
> a bug needing to be fixed.

I guess you means 'consumer' here. The scheduler doesn't fail to migrate
it: the consumer is actually migrated a lot of times, but on each cpu a
competing and running ksoftirqd thread is found.

The general problem is that under significant network load (not
necessary udp flood, similar behavior is observed even with TCP_RR
tests), with enough rx queue available and enough flows running, no
single thread/process can use 100% of any cpu, even if the overall
capacity would allow it.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] netdev: implement infrastructure for threadable napi irq
  2016-06-16 12:03             ` Paolo Abeni
@ 2016-06-16 16:55               ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2016-06-16 16:55 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: LKML, Thomas Gleixner, David S. Miller, Steven Rostedt,
	Peter Zijlstra (Intel),
	Ingo Molnar, Hannes Frederic Sowa, netdev

>
> I guess you means 'consumer' here. The scheduler doesn't fail to migrate
> it: the consumer is actually migrated a lot of times, but on each cpu a
> competing and running ksoftirqd thread is found.
>
> The general problem is that under significant network load (not
> necessary udp flood, similar behavior is observed even with TCP_RR
> tests), with enough rx queue available and enough flows running, no
> single thread/process can use 100% of any cpu, even if the overall
> capacity would allow it.
>

Looks like a general process scheduler issue ?

Really, allowing the RX processing to be migrated among cpus is
problematic for TCP,
as it will increase reorders.

RFS for example has a very specific logic to avoid these problems as
much as possible.

                /*
                 * If the desired CPU (where last recvmsg was done) is
                 * different from current CPU (one in the rx-queue flow
                 * table entry), switch if one of the following holds:
                 *   - Current CPU is unset (>= nr_cpu_ids).
                 *   - Current CPU is offline.
                 *   - The current CPU's queue tail has advanced beyond the
                 *     last packet that was enqueued using this table entry.
                 *     This guarantees that all previous packets for the flow
                 *     have been dequeued, thus preserving in order delivery.
                 */
                if (unlikely(tcpu != next_cpu) &&
                    (tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
                     ((int)(per_cpu(softnet_data, tcpu).input_queue_head -
                      rflow->last_qtail)) >= 0)) {
                        tcpu = next_cpu;
                        rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
                }

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-06-16 16:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15 13:42 [PATCH 0/5] genirq: threadable IRQ support Paolo Abeni
2016-06-15 13:42 ` [PATCH 1/5] genirq: implement support for runtime switch to threaded irqs Paolo Abeni
2016-06-15 14:50   ` kbuild test robot
2016-06-15 13:42 ` [PATCH 2/5] genirq: add flags for controlling the default threaded irq behavior Paolo Abeni
2016-06-15 13:42 ` [PATCH 3/5] sched/preempt: cond_resched_softirq() must check for softirq Paolo Abeni
2016-06-15 13:48   ` Peter Zijlstra
2016-06-15 14:00     ` Paolo Abeni
2016-06-15 13:42 ` [PATCH 4/5] netdev: implement infrastructure for threadable napi irq Paolo Abeni
2016-06-15 14:12   ` kbuild test robot
2016-06-15 14:17   ` Eric Dumazet
2016-06-15 14:21     ` Eric Dumazet
2016-06-15 16:42     ` Paolo Abeni
2016-06-15 17:04       ` Eric Dumazet
2016-06-16 10:39         ` Paolo Abeni
2016-06-16 11:19           ` Eric Dumazet
2016-06-16 12:03             ` Paolo Abeni
2016-06-16 16:55               ` Eric Dumazet
2016-06-15 13:42 ` [PATCH 5/5] ixgbe: add support for threadable rx irq Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).