All of lore.kernel.org
 help / color / mirror / Atom feed
* net: Automatic IRQ siloing for network devices
@ 2011-04-15 20:17 Neil Horman
  2011-04-15 20:17 ` [PATCH 1/3] irq: Add registered affinity guidance infrastructure Neil Horman
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev; +Cc: davem

Automatic IRQ siloing for network devices

At last years netconf:
http://vger.kernel.org/netconf2010.html

Tom Herbert gave a talk in which he outlined some of the things we can do to
improve scalability and througput in our network stack

One of the big items on the slides was the notion of siloing irqs, which is the
practice of setting irq affinity to a cpu or cpu set that was 'close' to the
process that would be consuming data.  The idea was to ensure that a hard irq
for a nic (and its subsequent softirq) would execute on the same cpu as the
process consuming the data, increasing cache hit rates and speeding up overall
throughput.

I had taken an idea away from that talk, and have finally gotten around to
implementing it.  One of the problems with the above approach is that its all
quite manual.  I.e. to properly enact this siloiong, you have to do a few things
by hand:

1) decide which process is the heaviest user of a given rx queue 
2) restrict the cpus which that task will run on
3) identify the irq which the rx queue in (1) maps to
4) manually set the affinity for the irq in (3) to cpus which match the cpus in
(2)

That configuration of course has to change in response to workload changed (what
if your consumer process gets reworked so that its no longer the largest network
user, etc).  

I thought it would be good if we could automate some amount of this, and I think
I've found a way to do that.  With this patch set I introduce the ability to:

A) Register common affinity monitoring routines against a given irq which can
implement various algorithms to determine a suggested placement of said irq's
affinity

B) Add an algorithm to the network subsystem to track the amount of data that
flows through each entry in a given rx_queues rps_flow_table, and uses that data
to suggest an affinity for the irq associated with that rx queue.

This patchset lets these affinity suggestions get exported via the
/proc/irq/<n>/affinity_hint interface (which is unused in the kernel with the
exception of ixgbe).  It also exports a new proc file affinity_alg which informs
anyone interested in the affinity_hint how the hint is being computed.

Testing:
	I've been running this patchset on my dual core system here with a cxgb4
as my network interface.  I've been running a TCP STREAMS test from netperf in 2
minute increments under various conditions.  I've found experimentally that (as
you might expect) optimal performance is reached when irq affinity is bound to a
core that is not the cpu core identified by the largest RFS flow, but is as
close to it as possible (ideally sharing an L2 cache).  In that way with we
avoid the cpu contention between the softirq and the application, while still
maximizing cache hits.  In congunction with the irqbalance patch I hacked up
here:

http://people.redhat.com/nhorman/irqbalance.patch

To steer irqs that have affinity using the rfs max weight algorithm to cpus that
are as close as possible to the hinted cpu, I'm able to get approximately a 3%
speedup in receive rates over the pessimal case, and about a 1% speedup over the
nominal case (statically setting irq affinity to a single cpu).

Note: Currently this patch set only updates cxgb4 to use the new hinting
mechanism.  If this gets accepted, I have more cards to test with and plan to
update them, but I thought for a first pass it would be better to simply update
what I tested with.

Thoughts/Opinions appreciated

Thanks & Regards
Neil


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] irq: Add registered affinity guidance infrastructure
  2011-04-15 20:17 net: Automatic IRQ siloing for network devices Neil Horman
@ 2011-04-15 20:17 ` Neil Horman
  2011-04-16  0:22   ` Thomas Gleixner
  2011-04-15 20:17 ` [PATCH 2/3] net: Add net device irq siloing feature Neil Horman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert

From: nhorman <nhorman@devel2.think-freely.org>

This patch adds the needed data to the irq_desc struct, as well as the needed
API calls to allow the requester of an irq to register a handler function to
determine the affinity_hint of that irq when queried from user space.

Signed-offy-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 include/linux/interrupt.h |   38 +++++++++++++++++++++++++++++++++++++-
 include/linux/irq.h       |    9 +++++++++
 include/linux/irqdesc.h   |    4 ++++
 kernel/irq/Kconfig        |   12 +++++++++++-
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   35 +++++++++++++++++++++++++++++++++++
 6 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 59b72ca..6edb364 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -118,6 +118,17 @@ struct irqaction {
 } ____cacheline_internodealigned_in_smp;
 
 extern irqreturn_t no_action(int cpl, void *dev_id);
+#ifdef CONFIG_AFFINITY_UPDATE
+extern int setup_affinity_data(int irq, irq_affinity_init_t, void *);
+#else
+static inline int setup_affinity_data(int irq,
+				      irq_affinity_init_t init, void *d)
+{
+	return 0;
+}
+#endif
+
+extern void free_irq(unsigned int, void *);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
 extern int __must_check
@@ -125,6 +136,32 @@ request_threaded_irq(unsigned int irq, irq_handler_t handler,
 		     irq_handler_t thread_fn,
 		     unsigned long flags, const char *name, void *dev);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+static inline int __must_check
+request_affinity_irq(unsigned int irq, irq_handler_t handler,
+		     irq_handler_t thread_fn,
+		     unsigned long flags, const char *name, void *dev,
+		     irq_affinity_init_t af_init, void *af_priv)
+{
+	int rc;
+
+	rc = request_threaded_irq(irq, handler, thread_fn, flags, name, dev);
+	if (rc)
+		goto out;
+
+	if (af_init)
+		rc = setup_affinity_data(irq, af_init, af_priv);
+	if (rc)
+		free_irq(irq, dev);
+
+out:
+	return rc;
+}
+#else
+#define request_affinity_irq(irq, hnd, tfn, flg, nm, dev, init, priv) \
+	request_threaded_irq(irq, hnd, NULL, flg, nm, dev)
+#endif
+
 static inline int __must_check
 request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
 	    const char *name, void *dev)
@@ -167,7 +204,6 @@ request_any_context_irq(unsigned int irq, irq_handler_t handler,
 static inline void exit_irq_thread(void) { }
 #endif
 
-extern void free_irq(unsigned int, void *);
 
 struct device;
 
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 1d3577f..4bff14f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -30,6 +30,15 @@
 
 struct irq_desc;
 struct irq_data;
+struct affin_data {
+	void *priv;
+	char *affinity_alg;
+	void (*affin_update)(int irq, struct affin_data *ad);
+	void (*affin_cleanup)(int irq, struct affin_data *ad);
+};
+
+typedef int (*irq_affinity_init_t)(int, struct affin_data*, void *);
+
 typedef	void (*irq_flow_handler_t)(unsigned int irq,
 					    struct irq_desc *desc);
 typedef	void (*irq_preflow_handler_t)(struct irq_data *data);
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 0021837..14a22fb 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -64,6 +64,10 @@ struct irq_desc {
 	struct timer_rand_state *timer_rand_state;
 	unsigned int __percpu	*kstat_irqs;
 	irq_flow_handler_t	handle_irq;
+#ifdef CONFIG_AFFINITY_UPDATE
+	struct affin_data	*af_data;
+#endif
+
 #ifdef CONFIG_IRQ_PREFLOW_FASTEOI
 	irq_preflow_handler_t	preflow_handler;
 #endif
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 09bef82..abaf19c 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -51,6 +51,17 @@ config IRQ_PREFLOW_FASTEOI
 config IRQ_FORCED_THREADING
        bool
 
+config AFFINITY_UPDATE
+	bool "Support irq affinity direction"
+	depends on GENERIC_HARDIRQS
+	---help---
+
+	Affinity updating adds the ability for requestors of irqs to
+	register affinity update methods against the irq in question
+	in so doing the requestor can be informed every time user space
+	queries an irq for its optimal affinity, giving the requstor the
+	chance to tell user space where the irq can be optimally handled
+
 config SPARSE_IRQ
 	bool "Support sparse irq numbering"
 	depends on HAVE_SPARSE_IRQ
@@ -64,6 +75,5 @@ config SPARSE_IRQ
 	    out the interrupt descriptors in a more NUMA-friendly way. )
 
 	  If you don't know what to do here, say N.
-
 endmenu
 endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index acd599a..257ea4d 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1159,6 +1159,17 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 
 	unregister_handler_proc(irq, action);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+	/*
+	 * Have to do this after we unregister proc accessors
+	 */
+	if (desc->af_data) {
+		if (desc->af_data->affin_cleanup)
+			desc->af_data->affin_cleanup(irq, desc->af_data);
+		kfree(desc->af_data);
+		desc->af_data = NULL;
+	}
+#endif
 	/* Make sure it's not being used on another CPU: */
 	synchronize_irq(irq);
 
@@ -1345,6 +1356,34 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
 }
 EXPORT_SYMBOL(request_threaded_irq);
 
+#ifdef CONFIG_AFFINITY_UPDATE
+int setup_affinity_data(int irq, irq_affinity_init_t af_init, void *af_priv)
+{
+	struct affin_data *data;
+	struct irq_desc *desc;
+	int rc;
+
+	desc = irq_to_desc(irq);
+	if (!desc)
+		return -ENOENT;
+
+	data = kzalloc(sizeof(struct affin_data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	rc = af_init(irq, data, af_priv);
+	if (rc) {
+		kfree(data);
+		return rc;
+	}
+
+	desc->af_data = data;
+
+	return 0;
+}
+EXPORT_SYMBOL(setup_affinity_data);
+#endif
+
 /**
  *	request_any_context_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4cc2e5e..8fecb05 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -42,6 +42,11 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
 		return -ENOMEM;
 
+#ifdef CONFIG_AFFINITY_UPDATE
+	if (desc->af_data && desc->af_data->affin_update)
+		desc->af_data->affin_update((long)m->private, desc->af_data);
+#endif
+
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	if (desc->affinity_hint)
 		cpumask_copy(mask, desc->affinity_hint);
@@ -54,6 +59,19 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
+{
+	char *alg = "none";
+#ifdef CONFIG_AFFINITY_UPDATE
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+
+	if (desc->af_data->affinity_alg)
+		alg = desc->af_data->affinity_alg;
+#endif
+	seq_printf(m, "%s\n", alg);
+	return 0;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -110,6 +128,11 @@ static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
 	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_alg_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_alg_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -125,6 +148,13 @@ static const struct file_operations irq_affinity_hint_proc_fops = {
 	.release	= single_release,
 };
 
+static const struct file_operations irq_affinity_alg_proc_fops = {
+	.open		= irq_affinity_alg_proc_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -288,6 +318,11 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/affinity_hint */
 	proc_create_data("affinity_hint", 0400, desc->dir,
 			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
+#ifdef CONFIG_AFFINITY_UPDATE
+	/* Create /proc/irq/<irq>/affinity_alg */
+	proc_create_data("affinity_alg", 0400, desc->dir,
+			&irq_affinity_alg_proc_fops, (void *)(long)irq);
+#endif
 
 	proc_create_data("node", 0444, desc->dir,
 			 &irq_node_proc_fops, (void *)(long)irq);
-- 
1.7.4.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-15 20:17 net: Automatic IRQ siloing for network devices Neil Horman
  2011-04-15 20:17 ` [PATCH 1/3] irq: Add registered affinity guidance infrastructure Neil Horman
@ 2011-04-15 20:17 ` Neil Horman
  2011-04-15 22:49   ` Ben Hutchings
  2011-04-15 20:17 ` [PATCH 3/3] net: Adding siloing irqs to cxgb4 driver Neil Horman
  2011-04-15 22:54 ` net: Automatic IRQ siloing for network devices Ben Hutchings
  3 siblings, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, Neil Horman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert

Using the irq affinity infrastrucuture, we can now allow net devices to call
request_irq using a new wrapper function (request_net_irq), which will attach a
common affinty_update handler to each requested irq.  This affinty update
mechanism correlates each tracked irq to the flow(s) that said irq processes
most frequently.  The highest traffic flow is noted, marked and exported to user
space via the affinity_hint proc file for each irq. In this way, utilities like
irqbalance are able to determine  which cpu is recieving the most data from each
rx queue on a given NIC, and set irq affinity accordingly.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h  |   18 +++++++
 kernel/irq/proc.c          |    2 +-
 net/Kconfig                |   12 +++++
 net/core/dev.c             |  107 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/sysctl_net_core.c |    9 ++++
 5 files changed, 147 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5eeb2cd..ba6191f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -609,6 +609,9 @@ struct rps_map {
 struct rps_dev_flow {
 	u16 cpu;
 	u16 filter;
+#ifdef CONFIG_RFS_SILOING
+	u32 weight;
+#endif
 	unsigned int last_qtail;
 };
 #define RPS_NO_FILTER 0xffff
@@ -1631,6 +1634,21 @@ static inline void unregister_netdevice(struct net_device *dev)
 	unregister_netdevice_queue(dev, NULL);
 }
 
+#ifdef CONFIG_RFS_SILOING
+extern int netdev_rxq_silo_init(int irq, struct affin_data *afd, void *priv);
+extern int sysctl_irq_siloing_period;
+
+static inline int __must_check
+request_net_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
+		const char *name, void *dev, struct net_device *ndev, int rxq)
+{
+	return request_affinity_irq(irq, handler, NULL, flags, name, dev,
+				    netdev_rxq_silo_init, &ndev->_rx[rxq]);
+}
+#else
+#define request_net_irq(i, h, f, n, d, nd, r) request_irq(i, h, NULL, f, n, d)
+#endif
+
 extern int 		netdev_refcnt_read(const struct net_device *dev);
 extern void		free_netdev(struct net_device *dev);
 extern void		synchronize_net(void);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 8fecb05..d5a7e4d 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -65,7 +65,7 @@ static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_AFFINITY_UPDATE
 	struct irq_desc *desc = irq_to_desc((long)m->private);
 
-	if (desc->af_data->affinity_alg)
+	if (desc->af_data && desc->af_data->affinity_alg)
 		alg = desc->af_data->affinity_alg;
 #endif
 	seq_printf(m, "%s\n", alg);
diff --git a/net/Kconfig b/net/Kconfig
index 79cabf1..d6ef6f5 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,18 @@ config XPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config RFS_SILOING
+	boolean
+	depends on RFS_ACCEL && AFFINITY_UPDATE
+	default y
+	---help---
+	 This feature allows appropriately enabled network drivers to
+	 export affinity_hint data to user space based on the RFS flow hash
+	 table for the rx queue associated with a given interrupt.  This allows
+	 userspace to optimize irq affinity such that a given rx queue has its
+	 interrupt serviced on the same cpu/l2 cache/numa node running the process
+	 that consumes most of its data.
+
 menu "Network testing"
 
 config NET_PKTGEN
diff --git a/net/core/dev.c b/net/core/dev.c
index 0b88eba..4d86137 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -173,6 +173,9 @@
 #define PTYPE_HASH_SIZE	(16)
 #define PTYPE_HASH_MASK	(PTYPE_HASH_SIZE - 1)
 
+#ifdef CONFIG_RFS_SILOING
+int sysctl_irq_siloing_period;
+#endif
 static DEFINE_SPINLOCK(ptype_lock);
 static struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
 static struct list_head ptype_all __read_mostly;	/* Taps */
@@ -2640,6 +2643,9 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		rflow->filter = rc;
 		if (old_rflow->filter == rflow->filter)
 			old_rflow->filter = RPS_NO_FILTER;
+#ifdef CONFIG_RFS_SILOING
+		old_rflow->weight = rflow->weight = 0;
+#endif
 	out:
 #endif
 		rflow->last_qtail =
@@ -2723,6 +2729,10 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 		      rflow->last_qtail)) >= 0))
 			rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
 
+#ifdef CONFIG_RFS_SILOING
+		rflow->weight += skb->len;
+#endif
+
 		if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) {
 			*rflowp = rflow;
 			cpu = tcpu;
@@ -6224,6 +6234,103 @@ static struct hlist_head *netdev_create_hash(void)
 	return hash;
 }
 
+#ifdef CONFIG_RFS_SILOING
+struct netdev_rxq_affin_data {
+	struct netdev_rx_queue *q;
+	unsigned long last_update;
+	cpumask_var_t affinity_mask;
+};
+
+static void netdev_rxq_silo_affin_update(int irq, struct affin_data *afd)
+{
+	struct netdev_rxq_affin_data *afdp = afd->priv;
+	struct netdev_rx_queue *q = afdp->q;
+	struct rps_dev_flow_table *flow_table;
+	int i;
+	u16 tcpu;
+	u32 mw;
+	unsigned long next_update;
+
+	mw = tcpu = 0;
+
+	next_update = afdp->last_update + (sysctl_irq_siloing_period * HZ);
+
+	if (time_after(next_update, jiffies))
+		return;
+
+	afdp->last_update = jiffies;
+
+	irq_set_affinity_hint(irq, NULL);
+	cpumask_clear(afdp->affinity_mask);
+	rcu_read_lock();
+	flow_table = rcu_dereference(q->rps_flow_table);
+
+	if (!flow_table)
+		goto out;
+
+	for (i = 0; (i & flow_table->mask) == i; i++) {
+		if (mw < flow_table->flows[i].weight) {
+			tcpu = ACCESS_ONCE(flow_table->flows[i].cpu);
+			if (tcpu == RPS_NO_CPU)
+				continue;
+			mw = flow_table->flows[i].weight;
+		}
+	}
+
+
+	if (mw) {
+		cpumask_set_cpu(tcpu, afdp->affinity_mask);
+		irq_set_affinity_hint(irq, afdp->affinity_mask);
+	}
+out:
+	rcu_read_unlock();
+	return;
+}
+
+static void netdev_rxq_silo_cleanup(int irq, struct affin_data *afd)
+{
+	struct netdev_rxq_affin_data *afdp = afd->priv;
+
+	free_cpumask_var(afdp->affinity_mask);
+	kfree(afdp);
+	afd->priv = NULL;
+}
+
+/**
+ *	netdev_rxq_silo_init - setup an irq to be siloed
+ *
+ *	initalizes the irq data required to allow the networking
+ *	subsystem to determine which cpu is best suited to
+ *      service the passed in irq, and then export that data
+ *	via the affinity_hint proc interface
+ */
+int netdev_rxq_silo_init(int irq, struct affin_data *afd, void *priv)
+{
+	struct netdev_rxq_affin_data *afdp;
+
+	afd->priv = afdp = kzalloc(sizeof(struct netdev_rxq_affin_data),
+				   GFP_KERNEL);
+	if (!afdp)
+		return -ENOMEM;
+
+	if (!alloc_cpumask_var(&afdp->affinity_mask, GFP_KERNEL)) {
+		kfree(afdp);
+		return -ENOMEM;
+	}
+
+	cpumask_clear(afdp->affinity_mask);
+
+	afdp->q = priv;
+	afdp->last_update = jiffies;
+	afd->affin_update = netdev_rxq_silo_affin_update;
+	afd->affin_cleanup = netdev_rxq_silo_cleanup;
+	afd->affinity_alg = "net:rfs max weight";
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rxq_silo_init);
+#endif
+
 /* Initialize per network namespace state */
 static int __net_init netdev_init(struct net *net)
 {
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 385b609..b5c733e 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -158,6 +158,15 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+#ifdef CONFIG_RFS_SILOING
+	{
+		.procname	= "irq_siloing_period",
+		.data		= &sysctl_irq_siloing_period,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#endif
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",
-- 
1.7.4.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] net: Adding siloing irqs to cxgb4 driver
  2011-04-15 20:17 net: Automatic IRQ siloing for network devices Neil Horman
  2011-04-15 20:17 ` [PATCH 1/3] irq: Add registered affinity guidance infrastructure Neil Horman
  2011-04-15 20:17 ` [PATCH 2/3] net: Add net device irq siloing feature Neil Horman
@ 2011-04-15 20:17 ` Neil Horman
  2011-04-15 22:54 ` net: Automatic IRQ siloing for network devices Ben Hutchings
  3 siblings, 0 replies; 20+ messages in thread
From: Neil Horman @ 2011-04-15 20:17 UTC (permalink / raw)
  To: netdev
  Cc: davem, Neil Horman, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert

cxgb4 hardware has been tested here and shows correct functionality with
affinity hinting infrastructure

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

CC: Dimitris Michailidis <dm@chelsio.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: David Howells <dhowells@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
---
 drivers/net/cxgb4/cxgb4_main.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/cxgb4/cxgb4_main.c b/drivers/net/cxgb4/cxgb4_main.c
index 5352c8a..11aeef6 100644
--- a/drivers/net/cxgb4/cxgb4_main.c
+++ b/drivers/net/cxgb4/cxgb4_main.c
@@ -562,9 +562,11 @@ static int request_msix_queue_irqs(struct adapter *adap)
 		return err;
 
 	for_each_ethrxq(s, ethqidx) {
-		err = request_irq(adap->msix_info[msi].vec, t4_sge_intr_msix, 0,
+		err = request_net_irq(adap->msix_info[msi].vec, t4_sge_intr_msix, 0,
 				  adap->msix_info[msi].desc,
-				  &s->ethrxq[ethqidx].rspq);
+				  &s->ethrxq[ethqidx].rspq,
+				  adap->port[ethqidx/MAX_NPORTS],
+				  ethqidx % MAX_NPORTS);
 		if (err)
 			goto unwind;
 		msi++;
-- 
1.7.4.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-15 20:17 ` [PATCH 2/3] net: Add net device irq siloing feature Neil Horman
@ 2011-04-15 22:49   ` Ben Hutchings
  2011-04-16  1:49     ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Ben Hutchings @ 2011-04-15 22:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, davem, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert

On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> Using the irq affinity infrastrucuture, we can now allow net devices to call
> request_irq using a new wrapper function (request_net_irq), which will attach a
> common affinty_update handler to each requested irq.  This affinty update
> mechanism correlates each tracked irq to the flow(s) that said irq processes
> most frequently.  The highest traffic flow is noted, marked and exported to user
> space via the affinity_hint proc file for each irq. In this way, utilities like
> irqbalance are able to determine  which cpu is recieving the most data from each
> rx queue on a given NIC, and set irq affinity accordingly.
[...]

Is irqbalance expected to poll the affinity hints?  How often?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-15 20:17 net: Automatic IRQ siloing for network devices Neil Horman
                   ` (2 preceding siblings ...)
  2011-04-15 20:17 ` [PATCH 3/3] net: Adding siloing irqs to cxgb4 driver Neil Horman
@ 2011-04-15 22:54 ` Ben Hutchings
  2011-04-16  0:50   ` Ben Hutchings
  2011-04-16  1:59   ` Neil Horman
  3 siblings, 2 replies; 20+ messages in thread
From: Ben Hutchings @ 2011-04-15 22:54 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem

On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> Automatic IRQ siloing for network devices
> 
> At last years netconf:
> http://vger.kernel.org/netconf2010.html
> 
> Tom Herbert gave a talk in which he outlined some of the things we can do to
> improve scalability and througput in our network stack
> 
> One of the big items on the slides was the notion of siloing irqs, which is the
> practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> process that would be consuming data.  The idea was to ensure that a hard irq
> for a nic (and its subsequent softirq) would execute on the same cpu as the
> process consuming the data, increasing cache hit rates and speeding up overall
> throughput.
> 
> I had taken an idea away from that talk, and have finally gotten around to
> implementing it.  One of the problems with the above approach is that its all
> quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> by hand:
> 
> 1) decide which process is the heaviest user of a given rx queue 
> 2) restrict the cpus which that task will run on
> 3) identify the irq which the rx queue in (1) maps to
> 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> (2)
[...]

This presumably works well with small numbers of flows and/or large
numbers of queues.  You could scale it up somewhat by manipulating the
device's flow hash indirection table, but that usually only has 128
entries.  (Changing the indirection table is currently quite expensive,
though that could be changed.)

I see RFS and accelerated RFS as the only reasonable way to scale to
large numbers of flows.  And as part of accelerated RFS, I already did
the work for mapping CPUs to IRQs (note, not the other way round).  If
IRQ affinity keeps changing then it will significantly undermine the
usefulness of hardware flow steering.

Now I'm not saying that your approach is useless.  There is more
hardware out there with flow hashing than with flow steering, and there
are presumably many systems with small numbers of active flows.  But I
think we need to avoid having two features that conflict and a
requirement for administrators to make a careful selection between them.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] irq: Add registered affinity guidance infrastructure
  2011-04-15 20:17 ` [PATCH 1/3] irq: Add registered affinity guidance infrastructure Neil Horman
@ 2011-04-16  0:22   ` Thomas Gleixner
  2011-04-16  2:11     ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2011-04-16  0:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, davem, nhorman, Dimitris Michailidis, David Howells,
	Eric Dumazet, Tom Herbert

On Fri, 15 Apr 2011, Neil Horman wrote:

> From: nhorman <nhorman@devel2.think-freely.org>
> 
> This patch adds the needed data to the irq_desc struct, as well as the needed
> API calls to allow the requester of an irq to register a handler function to
> determine the affinity_hint of that irq when queried from user space.

This changelog simply sucks. It does not explain the rationale for
this churn at all.

Which problem is it solving?
Why are the current interfaces not sufficient?
....

> +#ifdef CONFIG_AFFINITY_UPDATE
> +extern int setup_affinity_data(int irq, irq_affinity_init_t, void *);

yuck, irq_affinity_init_t ???  

> +#ifdef CONFIG_AFFINITY_UPDATE
> +static inline int __must_check
> +request_affinity_irq(unsigned int irq, irq_handler_t handler,
> +		     irq_handler_t thread_fn,
> +		     unsigned long flags, const char *name, void *dev,
> +		     irq_affinity_init_t af_init, void *af_priv)

So next time we make a wrapper around request_affinity_irq() which
takes another 3 arguments?

> +{
> +	int rc;
> +
> +	rc = request_threaded_irq(irq, handler, thread_fn, flags, name, dev);
> +	if (rc)
> +		goto out;

Brilliant use case for a goto. _NOT_

> +	if (af_init)
> +		rc = setup_affinity_data(irq, af_init, af_priv);
> +	if (rc)
> +		free_irq(irq, dev);
> +
> +out:
> +	return rc;
> +}
> +#else
> +#define request_affinity_irq(irq, hnd, tfn, flg, nm, dev, init, priv) \
> +	request_threaded_irq(irq, hnd, NULL, flg, nm, dev)

Oh nice. tfn becomes magically NULL if that magic CONFIG switch is not
set.

  
>  struct irq_desc;
>  struct irq_data;
> +struct affin_data {

Gah. Do you think that I went to major pain to consolidate the irq
namespace just to accecpt another random one?

Also that's completely undocumented. Hint: 

# grep -C1 "/\*\*" $this_file

> +	void *priv;
> +	char *affinity_alg;

const perhaps ?

> +	void (*affin_update)(int irq, struct affin_data *ad);
> +	void (*affin_cleanup)(int irq, struct affin_data *ad);
> +};
> +
> +typedef int (*irq_affinity_init_t)(int, struct affin_data*, void *);

Whee. Why do you want a typedef for that ?

> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -51,6 +51,17 @@ config IRQ_PREFLOW_FASTEOI
>  config IRQ_FORCED_THREADING
>         bool
>  
> +config AFFINITY_UPDATE
> +	bool "Support irq affinity direction"
> +	depends on GENERIC_HARDIRQS

Right. We need a dependency for somthing which is inside of a guarded
section which selects GENERIC_HARDIRQS.

> +	---help---
> +
> +	Affinity updating adds the ability for requestors of irqs to
> +	register affinity update methods against the irq in question
> +	in so doing the requestor can be informed every time user space
> +	queries an irq for its optimal affinity, giving the requstor the
> +	chance to tell user space where the irq can be optimally handled

-ENOPARSE. I still do not understand what you are trying to solve.

> @@ -64,6 +75,5 @@ config SPARSE_IRQ
>  	    out the interrupt descriptors in a more NUMA-friendly way. )
>  
>  	  If you don't know what to do here, say N.
> -

Unrelated

>  endmenu
>  endif
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index acd599a..257ea4d 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1159,6 +1159,17 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>  
>  	unregister_handler_proc(irq, action);
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	/*
> +	 * Have to do this after we unregister proc accessors
> +	 */
> +	if (desc->af_data) {
> +		if (desc->af_data->affin_cleanup)
> +			desc->af_data->affin_cleanup(irq, desc->af_data);
> +		kfree(desc->af_data);
> +		desc->af_data = NULL;
> +	}
> +#endif

Grr. Aside of the fact, that I think that whole thing is silly and
overengineered, please move this out of line and keep your fricking
#ifdef mess out of the main code.

>  	/* Make sure it's not being used on another CPU: */
>  	synchronize_irq(irq);
>  
> @@ -1345,6 +1356,34 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
>  }
>  EXPORT_SYMBOL(request_threaded_irq);
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +int setup_affinity_data(int irq, irq_affinity_init_t af_init, void *af_priv)

That interface is completely wrong for various reasons:

1) Namespace violation: irq_....

2) This want's to be separated into a allocation and a setter function

> +{
> +	struct affin_data *data;
> +	struct irq_desc *desc;
> +	int rc;
> +
> +	desc = irq_to_desc(irq);
> +	if (!desc)
> +		return -ENOENT;
> +
> +	data = kzalloc(sizeof(struct affin_data), GFP_KERNEL);
> +	if (!data)
> +		return -ENOMEM;
> +
> +	rc = af_init(irq, data, af_priv);
> +	if (rc) {
> +		kfree(data);
> +		return rc;
> +	}
> +
> +	desc->af_data = data;

Right, we do this unlocked of course.

> +	return 0;
> +}
> +EXPORT_SYMBOL(setup_affinity_data);

No. That want's to be EXPORT_SYMBOL_GPL if at all.

> --- a/kernel/irq/proc.c
> +++ b/kernel/irq/proc.c
> @@ -42,6 +42,11 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
>  	if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	if (desc->af_data && desc->af_data->affin_update)
> +		desc->af_data->affin_update((long)m->private, desc->af_data);
> +#endif
> +

Yikes. How the hell is this related to the changelog and to the scope
of this function? 

This function shows the hint we agreed on and nothing else. We do not
call magic crap via proc.

Locking is not your favourite topic, right ?

>  	raw_spin_lock_irqsave(&desc->lock, flags);
>  	if (desc->affinity_hint)
>  		cpumask_copy(mask, desc->affinity_hint);
> @@ -54,6 +59,19 @@ static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
>  	return 0;
>  }
>  
> +static int irq_affinity_alg_proc_show(struct seq_file *m, void *v)
> +{
> +	char *alg = "none";
> +#ifdef CONFIG_AFFINITY_UPDATE
> +	struct irq_desc *desc = irq_to_desc((long)m->private);
> +
> +	if (desc->af_data->affinity_alg)
> +		alg = desc->af_data->affinity_alg;
> +#endif

Nice, we add the policy concept to the kernel another time. No, we
don't want policies in the kernel except there is some reasonable
explanation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-15 22:54 ` net: Automatic IRQ siloing for network devices Ben Hutchings
@ 2011-04-16  0:50   ` Ben Hutchings
  2011-04-16  1:59   ` Neil Horman
  1 sibling, 0 replies; 20+ messages in thread
From: Ben Hutchings @ 2011-04-16  0:50 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem

On Fri, 2011-04-15 at 23:54 +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Automatic IRQ siloing for network devices
> > 
> > At last years netconf:
> > http://vger.kernel.org/netconf2010.html
> > 
> > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > improve scalability and througput in our network stack
> > 
> > One of the big items on the slides was the notion of siloing irqs, which is the
> > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > process that would be consuming data.  The idea was to ensure that a hard irq
> > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > process consuming the data, increasing cache hit rates and speeding up overall
> > throughput.
> > 
> > I had taken an idea away from that talk, and have finally gotten around to
> > implementing it.  One of the problems with the above approach is that its all
> > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > by hand:
> > 
> > 1) decide which process is the heaviest user of a given rx queue 
> > 2) restrict the cpus which that task will run on
> > 3) identify the irq which the rx queue in (1) maps to
> > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > (2)
> [...]
> 
> This presumably works well with small numbers of flows and/or large
> numbers of queues.  You could scale it up somewhat by manipulating the
> device's flow hash indirection table, but that usually only has 128
> entries.  (Changing the indirection table is currently quite expensive,
> though that could be changed.)
[...]

Actually, I reckon you could do a more or less generic implementation of
accelerated RFS on top of a flow hash indirection table.  It would
require the drivers to provide a new function to update single table
entries, and some way to switch between automatic configuration by RFS
and manual configuration with ethtool.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-15 22:49   ` Ben Hutchings
@ 2011-04-16  1:49     ` Neil Horman
  2011-04-16  4:52       ` Stephen Hemminger
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-16  1:49 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, davem, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert

On Fri, Apr 15, 2011 at 11:49:03PM +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Using the irq affinity infrastrucuture, we can now allow net devices to call
> > request_irq using a new wrapper function (request_net_irq), which will attach a
> > common affinty_update handler to each requested irq.  This affinty update
> > mechanism correlates each tracked irq to the flow(s) that said irq processes
> > most frequently.  The highest traffic flow is noted, marked and exported to user
> > space via the affinity_hint proc file for each irq. In this way, utilities like
> > irqbalance are able to determine  which cpu is recieving the most data from each
> > rx queue on a given NIC, and set irq affinity accordingly.
> [...]
> 
> Is irqbalance expected to poll the affinity hints?  How often?
> 
Yes, its done just that for quite some time.  Intel added that ability at the
same time they added the affinity_hint proc file.  Irqbalance polls the
affinity_hint file at the same time it rebalances all irqs (every 10 seconds).
If the affinity_hint is non-zero, irqbalance just copies it to smp_affinity for
the same irq.  Up until now thats been just about dead code because only ixgbe
sets affinity_hint.  Thats why I added the affinity_alg file, so irqbalance
could do something more intellegent than just a blind copy.  With the patch that
I referenced I added code to irqbalance to allow it to preform different
balancing methods based on the output of affinity_alg.
Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-15 22:54 ` net: Automatic IRQ siloing for network devices Ben Hutchings
  2011-04-16  0:50   ` Ben Hutchings
@ 2011-04-16  1:59   ` Neil Horman
  2011-04-16 16:17     ` Stephen Hemminger
  1 sibling, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-16  1:59 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, davem

On Fri, Apr 15, 2011 at 11:54:29PM +0100, Ben Hutchings wrote:
> On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > Automatic IRQ siloing for network devices
> > 
> > At last years netconf:
> > http://vger.kernel.org/netconf2010.html
> > 
> > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > improve scalability and througput in our network stack
> > 
> > One of the big items on the slides was the notion of siloing irqs, which is the
> > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > process that would be consuming data.  The idea was to ensure that a hard irq
> > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > process consuming the data, increasing cache hit rates and speeding up overall
> > throughput.
> > 
> > I had taken an idea away from that talk, and have finally gotten around to
> > implementing it.  One of the problems with the above approach is that its all
> > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > by hand:
> > 
> > 1) decide which process is the heaviest user of a given rx queue 
> > 2) restrict the cpus which that task will run on
> > 3) identify the irq which the rx queue in (1) maps to
> > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > (2)
> [...]
> 
> This presumably works well with small numbers of flows and/or large
> numbers of queues.  You could scale it up somewhat by manipulating the
> device's flow hash indirection table, but that usually only has 128
> entries.  (Changing the indirection table is currently quite expensive,
> though that could be changed.)
> 
> I see RFS and accelerated RFS as the only reasonable way to scale to
> large numbers of flows.  And as part of accelerated RFS, I already did
> the work for mapping CPUs to IRQs (note, not the other way round).  If
> IRQ affinity keeps changing then it will significantly undermine the
> usefulness of hardware flow steering.
> 
> Now I'm not saying that your approach is useless.  There is more
> hardware out there with flow hashing than with flow steering, and there
> are presumably many systems with small numbers of active flows.  But I
> think we need to avoid having two features that conflict and a
> requirement for administrators to make a careful selection between them.
> 
> Ben.
> 
I hear what your saying and I agree, theres no point in having features work
against each other.  That said, I'm not sure I agree that these features have to
work against one another, nor does a sysadmin need to make a choice between the
two.  Note the third patch in this series.  Making this work requires that
network drivers wanting to participate in this affinity algorithm opt in by
using the request_net_irq macro to attach the interrupt to the rfs affinity code
that I added.  Theres no reason that a driver which supports hardware that still
uses flow steering can't opt out of this algorithm, and as a result irqbalance
will still treat those interrupts as it normally does.  And for those drivers
which do opt in, irqbalance can take care of affinity assignment, using the
provided hint.  No need for sysadmin intervention.

I'm sure there can be improvements made to this code, but I think theres less
conflict between the work you've done and this code than there appears to be at
first blush.

Best
Neil

> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] irq: Add registered affinity guidance infrastructure
  2011-04-16  0:22   ` Thomas Gleixner
@ 2011-04-16  2:11     ` Neil Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Horman @ 2011-04-16  2:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, davem, nhorman, Dimitris Michailidis, David Howells,
	Eric Dumazet, Tom Herbert

On Sat, Apr 16, 2011 at 02:22:58AM +0200, Thomas Gleixner wrote:
> On Fri, 15 Apr 2011, Neil Horman wrote:
> 
> > From: nhorman <nhorman@devel2.think-freely.org>
> > 
> > This patch adds the needed data to the irq_desc struct, as well as the needed
> > API calls to allow the requester of an irq to register a handler function to
> > determine the affinity_hint of that irq when queried from user space.
> 
> This changelog simply sucks. It does not explain the rationale for
> this churn at all.
> 
It seems pretty clear to me.  It allows a common function to update the value of
affinity_hint when its queried.

> Which problem is it solving?
> Why are the current interfaces not sufficient?
Did you read the initial post that I sent with it?  Apparently not.  Apologies,
its seems my git-send-email didn't cc everyone I expected it to:
http://marc.info/?l=linux-netdev&m=130291921026187&w=2

I'll skip the rest of your your email, and just try to turn some of your rant
into something more acceptible to you.
Neil

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-16  1:49     ` Neil Horman
@ 2011-04-16  4:52       ` Stephen Hemminger
  2011-04-16  6:21         ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2011-04-16  4:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, davem, Dimitris Michailidis, Thomas Gleixner,
	David Howells, Eric Dumazet, Tom Herbert, Ben Hutchings


> On Fri, Apr 15, 2011 at 11:49:03PM +0100, Ben Hutchings wrote:
> > On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > > Using the irq affinity infrastrucuture, we can now allow net
> > > devices to call
> > > request_irq using a new wrapper function (request_net_irq), which
> > > will attach a
> > > common affinty_update handler to each requested irq. This affinty
> > > update mechanism correlates each tracked irq to the flow(s) that
> > > said irq processes
> > > most frequently. The highest traffic flow is noted, marked and
> > > exported to user
> > > space via the affinity_hint proc file for each irq. In this way,
> > > utilities like
> > > irqbalance are able to determine which cpu is recieving the most
> > > data from each
> > > rx queue on a given NIC, and set irq affinity accordingly.
> > [...]
> >
> > Is irqbalance expected to poll the affinity hints? How often?
> >
> Yes, its done just that for quite some time. Intel added that ability
> at the
> same time they added the affinity_hint proc file. Irqbalance polls the
> affinity_hint file at the same time it rebalances all irqs (every 10
> seconds). If the affinity_hint is non-zero, irqbalance just copies it
> to smp_affinity for
> the same irq. Up until now thats been just about dead code because
> only ixgbe
> sets affinity_hint. Thats why I added the affinity_alg file, so
> irqbalance could do something more intellegent than just a blind copy.
> With the patch that
> I referenced I added code to irqbalance to allow it to preform
> different balancing methods based on the output of affinity_alg.
> Neil

I hate the way more and more interfaces are becoming device driver
specific. It makes it impossible to build sane management infrastructure
and causes lots of customer and service complaints.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-16  4:52       ` Stephen Hemminger
@ 2011-04-16  6:21         ` Eric Dumazet
  2011-04-16 11:55           ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2011-04-16  6:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Neil Horman, netdev, davem, Dimitris Michailidis,
	Thomas Gleixner, David Howells, Tom Herbert, Ben Hutchings

Le vendredi 15 avril 2011 à 21:52 -0700, Stephen Hemminger a écrit :
> > On Fri, Apr 15, 2011 at 11:49:03PM +0100, Ben Hutchings wrote:
> > > On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > > > Using the irq affinity infrastrucuture, we can now allow net
> > > > devices to call
> > > > request_irq using a new wrapper function (request_net_irq), which
> > > > will attach a
> > > > common affinty_update handler to each requested irq. This affinty
> > > > update mechanism correlates each tracked irq to the flow(s) that
> > > > said irq processes
> > > > most frequently. The highest traffic flow is noted, marked and
> > > > exported to user
> > > > space via the affinity_hint proc file for each irq. In this way,
> > > > utilities like
> > > > irqbalance are able to determine which cpu is recieving the most
> > > > data from each
> > > > rx queue on a given NIC, and set irq affinity accordingly.
> > > [...]
> > >
> > > Is irqbalance expected to poll the affinity hints? How often?
> > >
> > Yes, its done just that for quite some time. Intel added that ability
> > at the
> > same time they added the affinity_hint proc file. Irqbalance polls the
> > affinity_hint file at the same time it rebalances all irqs (every 10
> > seconds). If the affinity_hint is non-zero, irqbalance just copies it
> > to smp_affinity for
> > the same irq. Up until now thats been just about dead code because
> > only ixgbe
> > sets affinity_hint. Thats why I added the affinity_alg file, so
> > irqbalance could do something more intellegent than just a blind copy.
> > With the patch that
> > I referenced I added code to irqbalance to allow it to preform
> > different balancing methods based on the output of affinity_alg.
> > Neil
> 
> I hate the way more and more interfaces are becoming device driver
> specific. It makes it impossible to build sane management infrastructure
> and causes lots of customer and service complaints.
> 

For me, the whole problem is the paradigm that we adapt IRQ to CPU were
applications _were_ running in last seconds, while process scheduler
might perform other choices, ie migrate task to cpu where IRQ was
happening (the cpu calling wakeups)

We can add logic to each layer, and yet not gain perfect behavior.

Some kind of cooperation is neeed.

Irqbalance for example is of no use in the case of a network flood
happening on your machine, because we enter NAPI mode for several
minutes on a single cpu. We'll need to add special logic in NAPI loop to
force an exit to reschedule an IRQ (so that another cpu can take it)




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] net: Add net device irq siloing feature
  2011-04-16  6:21         ` Eric Dumazet
@ 2011-04-16 11:55           ` Neil Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Horman @ 2011-04-16 11:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, netdev, davem, Dimitris Michailidis,
	Thomas Gleixner, David Howells, Tom Herbert, Ben Hutchings

On Sat, Apr 16, 2011 at 08:21:37AM +0200, Eric Dumazet wrote:
> Le vendredi 15 avril 2011 à 21:52 -0700, Stephen Hemminger a écrit :
> > > On Fri, Apr 15, 2011 at 11:49:03PM +0100, Ben Hutchings wrote:
> > > > On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > > > > Using the irq affinity infrastrucuture, we can now allow net
> > > > > devices to call
> > > > > request_irq using a new wrapper function (request_net_irq), which
> > > > > will attach a
> > > > > common affinty_update handler to each requested irq. This affinty
> > > > > update mechanism correlates each tracked irq to the flow(s) that
> > > > > said irq processes
> > > > > most frequently. The highest traffic flow is noted, marked and
> > > > > exported to user
> > > > > space via the affinity_hint proc file for each irq. In this way,
> > > > > utilities like
> > > > > irqbalance are able to determine which cpu is recieving the most
> > > > > data from each
> > > > > rx queue on a given NIC, and set irq affinity accordingly.
> > > > [...]
> > > >
> > > > Is irqbalance expected to poll the affinity hints? How often?
> > > >
> > > Yes, its done just that for quite some time. Intel added that ability
> > > at the
> > > same time they added the affinity_hint proc file. Irqbalance polls the
> > > affinity_hint file at the same time it rebalances all irqs (every 10
> > > seconds). If the affinity_hint is non-zero, irqbalance just copies it
> > > to smp_affinity for
> > > the same irq. Up until now thats been just about dead code because
> > > only ixgbe
> > > sets affinity_hint. Thats why I added the affinity_alg file, so
> > > irqbalance could do something more intellegent than just a blind copy.
> > > With the patch that
> > > I referenced I added code to irqbalance to allow it to preform
> > > different balancing methods based on the output of affinity_alg.
> > > Neil
> > 
> > I hate the way more and more interfaces are becoming device driver
> > specific. It makes it impossible to build sane management infrastructure
> > and causes lots of customer and service complaints.
> > 
> 
> For me, the whole problem is the paradigm that we adapt IRQ to CPU were
> applications _were_ running in last seconds, while process scheduler
> might perform other choices, ie migrate task to cpu where IRQ was
> happening (the cpu calling wakeups)
> 
> We can add logic to each layer, and yet not gain perfect behavior.
> 
> Some kind of cooperation is neeed.
> 
> Irqbalance for example is of no use in the case of a network flood
> happening on your machine, because we enter NAPI mode for several
> minutes on a single cpu. We'll need to add special logic in NAPI loop to
> force an exit to reschedule an IRQ (so that another cpu can take it)
> 
Would you consider an approach whereby we, instead of updating irq affinity to
match the process that consumes data from a given irq, bias the scheduler such
that process which consume data from a given irq not be moved away from the same
core/l2 cache being fed by that flow?  Do you have a suggestion for how best to
communicate that to the scheduler?  It would seem that interrogating the RFS
table from the scheduler might not be well received.

Best
Neil

> 
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-16  1:59   ` Neil Horman
@ 2011-04-16 16:17     ` Stephen Hemminger
  2011-04-17 17:20       ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Stephen Hemminger @ 2011-04-16 16:17 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ben Hutchings, netdev, davem

On Fri, 15 Apr 2011 21:59:38 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Fri, Apr 15, 2011 at 11:54:29PM +0100, Ben Hutchings wrote:
> > On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > > Automatic IRQ siloing for network devices
> > > 
> > > At last years netconf:
> > > http://vger.kernel.org/netconf2010.html
> > > 
> > > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > > improve scalability and througput in our network stack
> > > 
> > > One of the big items on the slides was the notion of siloing irqs, which is the
> > > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > > process that would be consuming data.  The idea was to ensure that a hard irq
> > > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > > process consuming the data, increasing cache hit rates and speeding up overall
> > > throughput.
> > > 
> > > I had taken an idea away from that talk, and have finally gotten around to
> > > implementing it.  One of the problems with the above approach is that its all
> > > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > > by hand:
> > > 
> > > 1) decide which process is the heaviest user of a given rx queue 
> > > 2) restrict the cpus which that task will run on
> > > 3) identify the irq which the rx queue in (1) maps to
> > > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > > (2)
> > [...]
> > 
> > This presumably works well with small numbers of flows and/or large
> > numbers of queues.  You could scale it up somewhat by manipulating the
> > device's flow hash indirection table, but that usually only has 128
> > entries.  (Changing the indirection table is currently quite expensive,
> > though that could be changed.)
> > 
> > I see RFS and accelerated RFS as the only reasonable way to scale to
> > large numbers of flows.  And as part of accelerated RFS, I already did
> > the work for mapping CPUs to IRQs (note, not the other way round).  If
> > IRQ affinity keeps changing then it will significantly undermine the
> > usefulness of hardware flow steering.
> > 
> > Now I'm not saying that your approach is useless.  There is more
> > hardware out there with flow hashing than with flow steering, and there
> > are presumably many systems with small numbers of active flows.  But I
> > think we need to avoid having two features that conflict and a
> > requirement for administrators to make a careful selection between them.
> > 
> > Ben.
> > 
> I hear what your saying and I agree, theres no point in having features work
> against each other.  That said, I'm not sure I agree that these features have to
> work against one another, nor does a sysadmin need to make a choice between the
> two.  Note the third patch in this series.  Making this work requires that
> network drivers wanting to participate in this affinity algorithm opt in by
> using the request_net_irq macro to attach the interrupt to the rfs affinity code
> that I added.  Theres no reason that a driver which supports hardware that still
> uses flow steering can't opt out of this algorithm, and as a result irqbalance
> will still treat those interrupts as it normally does.  And for those drivers
> which do opt in, irqbalance can take care of affinity assignment, using the
> provided hint.  No need for sysadmin intervention.
> 
> I'm sure there can be improvements made to this code, but I think theres less
> conflict between the work you've done and this code than there appears to be at
> first blush.
> 

My gut feeling is that:
  * kernel should default to a simple static sane irq policy without user
    space.  This is especially true for multi-queue devices where the default
    puts all IRQ's on one cpu.

  * irqbalance should do a one-shot rearrangement at boot up. It should rearrange
    when new IRQ's are requested. The kernel should have capablity to notify
    userspace (uevent?) when IRQ's are added or removed.

  * Let scheduler make decisions about migrating processes (rather than let irqbalance
    migrate IRQ's).

  * irqbalance should not do the hacks it does to try and guess at network traffic.


-- 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-16 16:17     ` Stephen Hemminger
@ 2011-04-17 17:20       ` Neil Horman
  2011-04-17 18:38         ` Ben Hutchings
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-17 17:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ben Hutchings, netdev, davem

On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
> On Fri, 15 Apr 2011 21:59:38 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Fri, Apr 15, 2011 at 11:54:29PM +0100, Ben Hutchings wrote:
> > > On Fri, 2011-04-15 at 16:17 -0400, Neil Horman wrote:
> > > > Automatic IRQ siloing for network devices
> > > > 
> > > > At last years netconf:
> > > > http://vger.kernel.org/netconf2010.html
> > > > 
> > > > Tom Herbert gave a talk in which he outlined some of the things we can do to
> > > > improve scalability and througput in our network stack
> > > > 
> > > > One of the big items on the slides was the notion of siloing irqs, which is the
> > > > practice of setting irq affinity to a cpu or cpu set that was 'close' to the
> > > > process that would be consuming data.  The idea was to ensure that a hard irq
> > > > for a nic (and its subsequent softirq) would execute on the same cpu as the
> > > > process consuming the data, increasing cache hit rates and speeding up overall
> > > > throughput.
> > > > 
> > > > I had taken an idea away from that talk, and have finally gotten around to
> > > > implementing it.  One of the problems with the above approach is that its all
> > > > quite manual.  I.e. to properly enact this siloiong, you have to do a few things
> > > > by hand:
> > > > 
> > > > 1) decide which process is the heaviest user of a given rx queue 
> > > > 2) restrict the cpus which that task will run on
> > > > 3) identify the irq which the rx queue in (1) maps to
> > > > 4) manually set the affinity for the irq in (3) to cpus which match the cpus in
> > > > (2)
> > > [...]
> > > 
> > > This presumably works well with small numbers of flows and/or large
> > > numbers of queues.  You could scale it up somewhat by manipulating the
> > > device's flow hash indirection table, but that usually only has 128
> > > entries.  (Changing the indirection table is currently quite expensive,
> > > though that could be changed.)
> > > 
> > > I see RFS and accelerated RFS as the only reasonable way to scale to
> > > large numbers of flows.  And as part of accelerated RFS, I already did
> > > the work for mapping CPUs to IRQs (note, not the other way round).  If
> > > IRQ affinity keeps changing then it will significantly undermine the
> > > usefulness of hardware flow steering.
> > > 
> > > Now I'm not saying that your approach is useless.  There is more
> > > hardware out there with flow hashing than with flow steering, and there
> > > are presumably many systems with small numbers of active flows.  But I
> > > think we need to avoid having two features that conflict and a
> > > requirement for administrators to make a careful selection between them.
> > > 
> > > Ben.
> > > 
> > I hear what your saying and I agree, theres no point in having features work
> > against each other.  That said, I'm not sure I agree that these features have to
> > work against one another, nor does a sysadmin need to make a choice between the
> > two.  Note the third patch in this series.  Making this work requires that
> > network drivers wanting to participate in this affinity algorithm opt in by
> > using the request_net_irq macro to attach the interrupt to the rfs affinity code
> > that I added.  Theres no reason that a driver which supports hardware that still
> > uses flow steering can't opt out of this algorithm, and as a result irqbalance
> > will still treat those interrupts as it normally does.  And for those drivers
> > which do opt in, irqbalance can take care of affinity assignment, using the
> > provided hint.  No need for sysadmin intervention.
> > 
> > I'm sure there can be improvements made to this code, but I think theres less
> > conflict between the work you've done and this code than there appears to be at
> > first blush.
> > 
> 
> My gut feeling is that:
>   * kernel should default to a simple static sane irq policy without user
>     space.  This is especially true for multi-queue devices where the default
>     puts all IRQ's on one cpu.
> 
Thats not how it currently works, AFAICS.  The default kernel policy is
currently that cpu affinity for any newly requested irq is all cpus.  Any
restriction beyond that is the purview and doing of userspace (irqbalance or
manual affinity setting).

>   * irqbalance should do a one-shot rearrangement at boot up. It should rearrange
>     when new IRQ's are requested. The kernel should have capablity to notify
>     userspace (uevent?) when IRQ's are added or removed.
> 
I can see that, and it would be easy to implement.  That said, I'm not sure what
criteria should be used when doing said re-arrangement.  Currently irqbalance
uses interrupt counts and names to determine how interrupts should be placed.
That is of course a hack, but its done because its currently the best
information available to user space.  Thats what this patch series was hoping to
address.  By exporting RFS flow data we give the opportunity to irqbalance to do
some modicum of better irq placement.

>   * Let scheduler make decisions about migrating processes (rather than let irqbalance
>     migrate IRQ's).
> 
I can certainly get behind this idea, I've been having trouble however comming
up with a good algorithm that lets the scheduler make a rational decision about
which cpu to run a process on.  I.e. how to do you weigh moving a process to a
cpu thats more local to the rx queue its reciving data on against the fact that
its also sharing a memory segment with another process on its current cpu.  I'd
like to be able to normalize these comparisons, but I'm not at all sure (yet)
how to do so.

>   * irqbalance should not do the hacks it does to try and guess at network traffic.
> 
Well, I can certainly agree with that, but I'm not sure what that looks like.

I could envision something like:

1) Use irqbalance to do a one time placement of interrupts, keeping a simple
(possibly sub-optimal) policy, perhaps something like new irqs get assigned to
the least loaded cpu within the numa node of the device the irq is originating
from.

2) Add a udev event on the addition of new interrupts, to rerun irqbalance

3) Add some exported information to identify processes that are high users of
network traffic, and correlate that usage to a rxq/irq that produces that
information (possibly some per-task proc file)

4) Create/expand an additional user space daemon to monitor the highest users of
network traffic on various rxq/irqs (as identified in (3)) and restrict those
processes execution to those cpus which are on the same L2 cache as the irq
itself.  The cpuset cgroup could be usefull in doing this perhaps.

Actually, as I read back to myself, that acutally sounds kind of good to me.  It
keeps all the policy for this in user space, and minimizes what we have to add
to the kernel to make it happen (some process information in /proc and another
udev event).  I'd like to get some feedback before I start implementing this,
but I think this could be done.  What do you think?

Thanks & Regards
Neil


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-17 17:20       ` Neil Horman
@ 2011-04-17 18:38         ` Ben Hutchings
  2011-04-18  1:08           ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Ben Hutchings @ 2011-04-17 18:38 UTC (permalink / raw)
  To: Neil Horman; +Cc: Stephen Hemminger, netdev, davem, Thomas Gleixner

On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote:
> On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
[...]
> > My gut feeling is that:
> >   * kernel should default to a simple static sane irq policy without user
> >     space.  This is especially true for multi-queue devices where the default
> >     puts all IRQ's on one cpu.
> > 
> Thats not how it currently works, AFAICS.  The default kernel policy is
> currently that cpu affinity for any newly requested irq is all cpus.  Any
> restriction beyond that is the purview and doing of userspace (irqbalance or
> manual affinity setting).

Right.  Though it may be reasonable for the kernel to use the hint as
the initial affinity for a newly allocated IRQ (not sure quite how we
determine that).

[...]
> >   * irqbalance should not do the hacks it does to try and guess at network traffic.
> > 
> Well, I can certainly agree with that, but I'm not sure what that looks like.
> 
> I could envision something like:
> 
> 1) Use irqbalance to do a one time placement of interrupts, keeping a simple
> (possibly sub-optimal) policy, perhaps something like new irqs get assigned to
> the least loaded cpu within the numa node of the device the irq is originating
> from.
> 
> 2) Add a udev event on the addition of new interrupts, to rerun irqbalance

Yes, making irqbalance more (or entirely) event-driven seems like a good
thing.

> 3) Add some exported information to identify processes that are high users of
> network traffic, and correlate that usage to a rxq/irq that produces that
> information (possibly some per-task proc file)
> 
> 4) Create/expand an additional user space daemon to monitor the highest users of
> network traffic on various rxq/irqs (as identified in (3)) and restrict those
> processes execution to those cpus which are on the same L2 cache as the irq
> itself.  The cpuset cgroup could be usefull in doing this perhaps.

I just don't see that you're going to get processes associated with
specific RX queues unless you make use of flow steering.

The 128-entry flow hash indirection table is part of Microsoft's
requirements for RSS so most multiqueue hardware is going to let you do
limited flow steering that way.

> Actually, as I read back to myself, that acutally sounds kind of good to me.  It
> keeps all the policy for this in user space, and minimizes what we have to add
> to the kernel to make it happen (some process information in /proc and another
> udev event).  I'd like to get some feedback before I start implementing this,
> but I think this could be done.  What do you think?

I don't think it's a good idea to override the scheduler dynamically
like this.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-17 18:38         ` Ben Hutchings
@ 2011-04-18  1:08           ` Neil Horman
  2011-04-18 21:51             ` Ben Hutchings
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Horman @ 2011-04-18  1:08 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Stephen Hemminger, netdev, davem, Thomas Gleixner

On Sun, Apr 17, 2011 at 07:38:59PM +0100, Ben Hutchings wrote:
> On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote:
> > On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
> [...]
> > > My gut feeling is that:
> > >   * kernel should default to a simple static sane irq policy without user
> > >     space.  This is especially true for multi-queue devices where the default
> > >     puts all IRQ's on one cpu.
> > > 
> > Thats not how it currently works, AFAICS.  The default kernel policy is
> > currently that cpu affinity for any newly requested irq is all cpus.  Any
> > restriction beyond that is the purview and doing of userspace (irqbalance or
> > manual affinity setting).
> 
> Right.  Though it may be reasonable for the kernel to use the hint as
> the initial affinity for a newly allocated IRQ (not sure quite how we
> determine that).
> 
So I understand what your saying here, but I'm having a hard time reconciling
the two notions.  Currently as it stands, affinity_hint gets set by a single
function call in the kernel (irq_set_affinity_hint), and is called by drivers
wishing to guide irqbalances behavior (currently only ixgbe does this).  The
behavior a driver is capable of guiding however are either overly simple (ixgbe
just tells irqbalance to place each irq on a separate cpu, which irqbalance
would do anyway) or overly complex (forcing policy into the kernel, which I
tried to do with this patch series, but based on the responses I've gotten here,
that seems non-desireable).

I personally like the idea of using affinity_hint to export various guidelines
to drive irqbalances behavior, but I think I'm in the minority on that.  Given
the responses here, It almost seems to me like we should do away with
affinity_hint alltogether (or perhaps just continue to ignore it), and rather
export data relevent to balancing in various other locations, and just have
irqbalance use that info directly.


> [...]
> > >   * irqbalance should not do the hacks it does to try and guess at network traffic.
> > > 
> > Well, I can certainly agree with that, but I'm not sure what that looks like.
> > 
> > I could envision something like:
> > 
> > 1) Use irqbalance to do a one time placement of interrupts, keeping a simple
> > (possibly sub-optimal) policy, perhaps something like new irqs get assigned to
> > the least loaded cpu within the numa node of the device the irq is originating
> > from.
> > 
> > 2) Add a udev event on the addition of new interrupts, to rerun irqbalance
> 
> Yes, making irqbalance more (or entirely) event-driven seems like a good
> thing.
> 
Yeah, I can do that, it shouldn't be hard.  Do we need to worry about bursty
interfaces (i.e. irqs that have high volume counts for periods of time followed
by periods of low activity).  A periodic irqbalance can reblance behavior like
that wheras a one-shot cannot.  Can we just assume that the one-shot rebalance
would give a 'good enough' placement in that situation?

> > 3) Add some exported information to identify processes that are high users of
> > network traffic, and correlate that usage to a rxq/irq that produces that
> > information (possibly some per-task proc file)
> > 
> > 4) Create/expand an additional user space daemon to monitor the highest users of
> > network traffic on various rxq/irqs (as identified in (3)) and restrict those
> > processes execution to those cpus which are on the same L2 cache as the irq
> > itself.  The cpuset cgroup could be usefull in doing this perhaps.
> 
> I just don't see that you're going to get processes associated with
> specific RX queues unless you make use of flow steering.
> 
> The 128-entry flow hash indirection table is part of Microsoft's
> requirements for RSS so most multiqueue hardware is going to let you do
> limited flow steering that way.
> 
Yes, Agreed.  I had presumed that any feature like this would assume RFS was
available and enabled. If it wasn't, we'd have to re-implement some metric
gathering code along with code to correlate sockets to rx queues and irqs.  That
would be 90% of the RFS code and this patch series :)

> > Actually, as I read back to myself, that acutally sounds kind of good to me.  It
> > keeps all the policy for this in user space, and minimizes what we have to add
> > to the kernel to make it happen (some process information in /proc and another
> > udev event).  I'd like to get some feedback before I start implementing this,
> > but I think this could be done.  What do you think?
> 
> I don't think it's a good idea to override the scheduler dynamically
> like this.
> 
Why not?  Not disagreeing here, but I'm curious as to why you think this is bad.
We already have several interfaces for doing this in user space (cgroups and
taskset come to mind).  Nominally they are used directly by sysadmins, and used
sparingly for specific configurations.  All I'm suggesting is that we create a
daemon to identify processes that would benefit from running closer to the nics
they are getting data from, and restricting them to cpus that fit that benefit.
If a sysadmin doesn't want that behavior, they can stop the daemon, or change
its configuration to avoid including processes they don't want to move/restrict.

Thanks & Regards
Neil
 
> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-18  1:08           ` Neil Horman
@ 2011-04-18 21:51             ` Ben Hutchings
  2011-04-19  0:52               ` Neil Horman
  0 siblings, 1 reply; 20+ messages in thread
From: Ben Hutchings @ 2011-04-18 21:51 UTC (permalink / raw)
  To: Neil Horman
  Cc: Stephen Hemminger, netdev, davem, Thomas Gleixner,
	Alexander Duyck, Jeff Kirsher

On Sun, 2011-04-17 at 21:08 -0400, Neil Horman wrote:
> On Sun, Apr 17, 2011 at 07:38:59PM +0100, Ben Hutchings wrote:
> > On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote:
> > > On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
> > [...]
> > > > My gut feeling is that:
> > > >   * kernel should default to a simple static sane irq policy without user
> > > >     space.  This is especially true for multi-queue devices where the default
> > > >     puts all IRQ's on one cpu.
> > > > 
> > > Thats not how it currently works, AFAICS.  The default kernel policy is
> > > currently that cpu affinity for any newly requested irq is all cpus.  Any
> > > restriction beyond that is the purview and doing of userspace (irqbalance or
> > > manual affinity setting).
> > 
> > Right.  Though it may be reasonable for the kernel to use the hint as
> > the initial affinity for a newly allocated IRQ (not sure quite how we
> > determine that).
> > 
> So I understand what your saying here, but I'm having a hard time reconciling
> the two notions.  Currently as it stands, affinity_hint gets set by a single
> function call in the kernel (irq_set_affinity_hint), and is called by drivers
> wishing to guide irqbalances behavior (currently only ixgbe does this).  The
> behavior a driver is capable of guiding however are either overly simple (ixgbe
> just tells irqbalance to place each irq on a separate cpu, which irqbalance
> would do anyway)

It's a bit more subtle than that.

ixgbe is trying to set up hardware flow steering.  Some versions of the
hardware can steer packets to RX queues based on the TX queue that was
last used for the same flow.  The TX queue selection based on CPU in
ixgbe_select_queue() should be the inverse of the IRQ affinity mapping
of RX queues, and the affinity hints are supposed to ensure that this is
true.

I think it should be possible to replace those hints with use of
irq_cpu_rmap for TX queue selection.

> or overly complex (forcing policy into the kernel, which I
> tried to do with this patch series, but based on the responses I've gotten here,
> that seems non-desireable).

The trouble is that irqbalance has been so bad for multiqueue net
devices in the past that many vendors (including Solarflare) recommended
that it be disabled.  I think irqbalance does sensible things now but
many systems will be running without it for some time to come.

I was thinking that if the drivers could set sane hints to start with
then it would improve matters for those systems without irqbalance.  But
maybe it would be better still for some part of the networking core or
IRQ core to set up a default spreading of multiqueue IRQs.

[...]
> > > Actually, as I read back to myself, that acutally sounds kind of good to me.  It
> > > keeps all the policy for this in user space, and minimizes what we have to add
> > > to the kernel to make it happen (some process information in /proc and another
> > > udev event).  I'd like to get some feedback before I start implementing this,
> > > but I think this could be done.  What do you think?
> > 
> > I don't think it's a good idea to override the scheduler dynamically
> > like this.
> > 
> Why not?  Not disagreeing here, but I'm curious as to why you think this is bad.
> We already have several interfaces for doing this in user space (cgroups and
> taskset come to mind).  Nominally they are used directly by sysadmins, and used
> sparingly for specific configurations.

Yes, that is why I think this is different.

> All I'm suggesting is that we create a
> daemon to identify processes that would benefit from running closer to the nics
> they are getting data from, and restricting them to cpus that fit that benefit.
> If a sysadmin doesn't want that behavior, they can stop the daemon, or change
> its configuration to avoid including processes they don't want to move/restrict.

I think this could improve latency under low CPU load and throughput
under high CPU load for small numbers of relatively long-lived flows.
But for large numbers of flows or high turnover of flows the affinity
will just be noise.

You're welcome to do your own experiments, obviously!

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: net: Automatic IRQ siloing for network devices
  2011-04-18 21:51             ` Ben Hutchings
@ 2011-04-19  0:52               ` Neil Horman
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Horman @ 2011-04-19  0:52 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Stephen Hemminger, netdev, davem, Thomas Gleixner,
	Alexander Duyck, Jeff Kirsher

On Mon, Apr 18, 2011 at 10:51:34PM +0100, Ben Hutchings wrote:
> On Sun, 2011-04-17 at 21:08 -0400, Neil Horman wrote:
> > On Sun, Apr 17, 2011 at 07:38:59PM +0100, Ben Hutchings wrote:
> > > On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote:
> > > > On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
> > > [...]
> > > > > My gut feeling is that:
> > > > >   * kernel should default to a simple static sane irq policy without user
> > > > >     space.  This is especially true for multi-queue devices where the default
> > > > >     puts all IRQ's on one cpu.
> > > > > 
> > > > Thats not how it currently works, AFAICS.  The default kernel policy is
> > > > currently that cpu affinity for any newly requested irq is all cpus.  Any
> > > > restriction beyond that is the purview and doing of userspace (irqbalance or
> > > > manual affinity setting).
> > > 
> > > Right.  Though it may be reasonable for the kernel to use the hint as
> > > the initial affinity for a newly allocated IRQ (not sure quite how we
> > > determine that).
> > > 
> > So I understand what your saying here, but I'm having a hard time reconciling
> > the two notions.  Currently as it stands, affinity_hint gets set by a single
> > function call in the kernel (irq_set_affinity_hint), and is called by drivers
> > wishing to guide irqbalances behavior (currently only ixgbe does this).  The
> > behavior a driver is capable of guiding however are either overly simple (ixgbe
> > just tells irqbalance to place each irq on a separate cpu, which irqbalance
> > would do anyway)
> 
> It's a bit more subtle than that.
> 
> ixgbe is trying to set up hardware flow steering.  Some versions of the
> hardware can steer packets to RX queues based on the TX queue that was
> last used for the same flow.  The TX queue selection based on CPU in
> ixgbe_select_queue() should be the inverse of the IRQ affinity mapping
> of RX queues, and the affinity hints are supposed to ensure that this is
> true.
> 
Ah, ok, that makes a bit more sense then.  Thank you for that.

> I think it should be possible to replace those hints with use of
> irq_cpu_rmap for TX queue selection.
> 
> > or overly complex (forcing policy into the kernel, which I
> > tried to do with this patch series, but based on the responses I've gotten here,
> > that seems non-desireable).
> 
> The trouble is that irqbalance has been so bad for multiqueue net
> devices in the past that many vendors (including Solarflare) recommended
> that it be disabled.  I think irqbalance does sensible things now but
> many systems will be running without it for some time to come.
> 
> I was thinking that if the drivers could set sane hints to start with
> then it would improve matters for those systems without irqbalance.  But
> maybe it would be better still for some part of the networking core or
> IRQ core to set up a default spreading of multiqueue IRQs.
>
But doesn't this force policy for irqbalancing into the kernel, as Thomas and
Eric alluded to?  It seems to me that, if we can export just a bit more
information regarding irqs and their associations to devices (which has been a
major achilles heel of irqblance in the past), then I think we can create a sane
default balancing policy with some simple udev rules.  I've been messing with
this a bit today.
 
> [...]
> > > > Actually, as I read back to myself, that acutally sounds kind of good to me.  It
> > > > keeps all the policy for this in user space, and minimizes what we have to add
> > > > to the kernel to make it happen (some process information in /proc and another
> > > > udev event).  I'd like to get some feedback before I start implementing this,
> > > > but I think this could be done.  What do you think?
> > > 
> > > I don't think it's a good idea to override the scheduler dynamically
> > > like this.
> > > 
> > Why not?  Not disagreeing here, but I'm curious as to why you think this is bad.
> > We already have several interfaces for doing this in user space (cgroups and
> > taskset come to mind).  Nominally they are used directly by sysadmins, and used
> > sparingly for specific configurations.
> 
> Yes, that is why I think this is different.
> 
Ok, fair enough.

> > All I'm suggesting is that we create a
> > daemon to identify processes that would benefit from running closer to the nics
> > they are getting data from, and restricting them to cpus that fit that benefit.
> > If a sysadmin doesn't want that behavior, they can stop the daemon, or change
> > its configuration to avoid including processes they don't want to move/restrict.
> 
> I think this could improve latency under low CPU load and throughput
> under high CPU load for small numbers of relatively long-lived flows.
> But for large numbers of flows or high turnover of flows the affinity
> will just be noise.
> 
> You're welcome to do your own experiments, obviously!
> 
I will, but I'll start with the low hanging fruit.  I'm going to try exporting
the msi table for a device.  With that I can use the netdev_registration uevent
to properly identify network based irqs without the need for 1/2 assed regex
searches and volume counts and do one shot rebalancing of them.

Thanks for your time & thoughts!
Neil

> Ben.
> 
> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-04-19  0:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-15 20:17 net: Automatic IRQ siloing for network devices Neil Horman
2011-04-15 20:17 ` [PATCH 1/3] irq: Add registered affinity guidance infrastructure Neil Horman
2011-04-16  0:22   ` Thomas Gleixner
2011-04-16  2:11     ` Neil Horman
2011-04-15 20:17 ` [PATCH 2/3] net: Add net device irq siloing feature Neil Horman
2011-04-15 22:49   ` Ben Hutchings
2011-04-16  1:49     ` Neil Horman
2011-04-16  4:52       ` Stephen Hemminger
2011-04-16  6:21         ` Eric Dumazet
2011-04-16 11:55           ` Neil Horman
2011-04-15 20:17 ` [PATCH 3/3] net: Adding siloing irqs to cxgb4 driver Neil Horman
2011-04-15 22:54 ` net: Automatic IRQ siloing for network devices Ben Hutchings
2011-04-16  0:50   ` Ben Hutchings
2011-04-16  1:59   ` Neil Horman
2011-04-16 16:17     ` Stephen Hemminger
2011-04-17 17:20       ` Neil Horman
2011-04-17 18:38         ` Ben Hutchings
2011-04-18  1:08           ` Neil Horman
2011-04-18 21:51             ` Ben Hutchings
2011-04-19  0:52               ` Neil Horman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.