All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ixgbe, fix numa issues
@ 2014-02-24 18:51 Prarit Bhargava
  2014-02-24 18:51 ` [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware Prarit Bhargava
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-24 18:51 UTC (permalink / raw)
  To: netdev
  Cc: Prarit Bhargava, nhorman, e1000-devel, Bruce Allan,
	Jesse Brandeburg, agospoda, John Ronciak, David S. Miller

The ixgbe driver makes some assumptions about the layout of cpus in the
system which are not always correct given a particular system layout.  The
ixgbe driver allocates one MSI/cpu for queue usage but the code does not take
into account that devices are located on NUMA nodes and that the cpus in a node
are not contiguous.

These issues were found while doing cpu hotplug testing, however, both of these
issues can lead to obvious system performance issues as they defeat the
purpose of having one MSI processing a queue per cpu.

Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Alex Duyck <alexander.h.duyck@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: nhorman@redhat.com
Cc: agospoda@redhat.com
Cc: e1000-devel@lists.sourceforge.net

Prarit Bhargava (2):
  ixgbe, make interrupt allocations NUMA aware
  ixgbe, don't assume mapping of numa node cpus

 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   44 ++++++++++++++++++------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +--
 4 files changed, 42 insertions(+), 15 deletions(-)

-- 
1.7.9.3


------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware
  2014-02-24 18:51 [PATCH 0/2] ixgbe, fix numa issues Prarit Bhargava
@ 2014-02-24 18:51 ` Prarit Bhargava
  2014-02-24 19:26   ` Alexander Duyck
  2014-02-24 18:51 ` [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus Prarit Bhargava
  2014-02-24 19:23 ` [PATCH 0/2] ixgbe, fix numa issues Alexander Duyck
  2 siblings, 1 reply; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-24 18:51 UTC (permalink / raw)
  To: netdev
  Cc: Prarit Bhargava, nhorman, e1000-devel, Bruce Allan,
	Jesse Brandeburg, agospoda, John Ronciak, David S. Miller

The ixgbe driver creates one queue/cpu on the system in order to spread
work out on all cpus rather than restricting work to a single cpu.  This
model, while efficient, does not take into account the NUMA configuration
of the system.

This patch introduces ixgbe_num_cpus() which returns
the number of online cpus if the adapter's PCI device has no NUMA
restrictions, and the number of cpus in the node if the PCI device is
allocated to a specific node.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Alex Duyck <alexander.h.duyck@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: nhorman@redhat.com
Cc: agospoda@redhat.com
Cc: e1000-devel@lists.sourceforge.net
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   28 +++++++++++++++++++++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +++--
 4 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 0186ea2..edee04b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -970,4 +970,6 @@ void ixgbe_sriov_reinit(struct ixgbe_adapter *adapter);
 netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 				  struct ixgbe_adapter *adapter,
 				  struct ixgbe_ring *tx_ring);
+
+extern int ixgbe_num_cpus(struct ixgbe_adapter *adapter);
 #endif /* _IXGBE_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 32e3eaa..3668288 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -480,6 +480,27 @@ static bool ixgbe_set_dcb_queues(struct ixgbe_adapter *adapter)
 }
 
 #endif
+
+/**
+ * ixgbe_num_cpus - Return the number of cpus that this adapter should
+ *		    allocate queues for.
+ * @adapter: board private structure to allocate cpus for
+ *
+ * A pci device may be restricted via ACPI and HW to a specific NUMA node,
+ * or in other words a specific set of cpus.  If the adapter's PCI device
+ * is on a specific node, then only allocate queues for that specific node.
+ *
+ **/
+int ixgbe_num_cpus(struct ixgbe_adapter *adapter)
+{
+	int numa;
+
+	numa = adapter->pdev->dev.numa_node;
+	if (numa == NUMA_NO_NODE)
+		return num_online_cpus();
+	return nr_cpus_node(numa);
+}
+
 /**
  * ixgbe_set_sriov_queues - Allocate queues for SR-IOV devices
  * @adapter: board private structure to initialize
@@ -567,7 +588,8 @@ static bool ixgbe_set_sriov_queues(struct ixgbe_adapter *adapter)
 			fcoe->offset = vmdq_i * rss_i;
 		} else {
 			/* merge FCoE queues with RSS queues */
-			fcoe_i = min_t(u16, fcoe_i + rss_i, num_online_cpus());
+			fcoe_i = min_t(u16, fcoe_i + rss_i,
+				       ixgbe_num_cpus(adapter));
 
 			/* limit indices to rss_i if MSI-X is disabled */
 			if (!(adapter->flags & IXGBE_FLAG_MSIX_ENABLED))
@@ -642,7 +664,7 @@ static bool ixgbe_set_rss_queues(struct ixgbe_adapter *adapter)
 		f = &adapter->ring_feature[RING_F_FCOE];
 
 		/* merge FCoE queues with RSS queues */
-		fcoe_i = min_t(u16, f->limit + rss_i, num_online_cpus());
+		fcoe_i = min_t(u16, f->limit + rss_i, ixgbe_num_cpus(adapter));
 		fcoe_i = min_t(u16, fcoe_i, dev->num_tx_queues);
 
 		/* limit indices to rss_i if MSI-X is disabled */
@@ -1067,7 +1089,7 @@ static void ixgbe_set_interrupt_capability(struct ixgbe_adapter *adapter)
 	 * The default is to use pairs of vectors.
 	 */
 	v_budget = max(adapter->num_rx_queues, adapter->num_tx_queues);
-	v_budget = min_t(int, v_budget, num_online_cpus());
+	v_budget = min_t(int, v_budget, ixgbe_num_cpus(adapter));
 	v_budget += NON_Q_VECTORS;
 
 	/*
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 18076c4..b68a6e9 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4953,13 +4953,13 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
 	hw->subsystem_device_id = pdev->subsystem_device;
 
 	/* Set common capability flags and settings */
-	rss = min_t(int, IXGBE_MAX_RSS_INDICES, num_online_cpus());
+	rss = min_t(int, IXGBE_MAX_RSS_INDICES, ixgbe_num_cpus(adapter));
 	adapter->ring_feature[RING_F_RSS].limit = rss;
 	adapter->flags2 |= IXGBE_FLAG2_RSC_CAPABLE;
 	adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
 	adapter->max_q_vectors = MAX_Q_VECTORS_82599;
 	adapter->atr_sample_rate = 20;
-	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
+	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, ixgbe_num_cpus(adapter));
 	adapter->ring_feature[RING_F_FDIR].limit = fdir;
 	adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
 #ifdef CONFIG_IXGBE_DCA
@@ -8074,7 +8074,7 @@ skip_sriov:
 		}
 
 
-		fcoe_l = min_t(int, IXGBE_FCRETA_SIZE, num_online_cpus());
+		fcoe_l = min_t(int, IXGBE_FCRETA_SIZE, ixgbe_num_cpus(adapter));
 		adapter->ring_feature[RING_F_FCOE].limit = fcoe_l;
 
 		netdev->features |= NETIF_F_FSO |
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index dff0977..bfbc574 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -222,9 +222,10 @@ int ixgbe_disable_sriov(struct ixgbe_adapter *adapter)
 	if (adapter->ring_feature[RING_F_VMDQ].limit == 1) {
 		adapter->flags &= ~IXGBE_FLAG_VMDQ_ENABLED;
 		adapter->flags &= ~IXGBE_FLAG_SRIOV_ENABLED;
-		rss = min_t(int, IXGBE_MAX_RSS_INDICES, num_online_cpus());
+		rss = min_t(int, IXGBE_MAX_RSS_INDICES,
+			    ixgbe_num_cpus(adapter));
 	} else {
-		rss = min_t(int, IXGBE_MAX_L2A_QUEUES, num_online_cpus());
+		rss = min_t(int, IXGBE_MAX_L2A_QUEUES, ixgbe_num_cpus(adapter));
 	}
 
 	adapter->ring_feature[RING_F_VMDQ].offset = 0;
-- 
1.7.9.3


------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus
  2014-02-24 18:51 [PATCH 0/2] ixgbe, fix numa issues Prarit Bhargava
  2014-02-24 18:51 ` [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware Prarit Bhargava
@ 2014-02-24 18:51 ` Prarit Bhargava
  2014-02-24 19:39   ` Alexander Duyck
  2014-02-25 17:27   ` Amir Vadai
  2014-02-24 19:23 ` [PATCH 0/2] ixgbe, fix numa issues Alexander Duyck
  2 siblings, 2 replies; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-24 18:51 UTC (permalink / raw)
  To: netdev
  Cc: Prarit Bhargava, nhorman, e1000-devel, Bruce Allan,
	Jesse Brandeburg, agospoda, John Ronciak, David S. Miller

The ixgbe driver assumes that the cpus on a node are mapped 1:1 with the
indexes into arrays.  This is not the case as nodes can contain, for
example, cpus 0-7, 33-40.

This patch fixes this problem.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
Cc: Don Skidmore <donald.c.skidmore@intel.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Alex Duyck <alexander.h.duyck@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: nhorman@redhat.com
Cc: agospoda@redhat.com
Cc: e1000-devel@lists.sourceforge.net
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 3668288..8b3992e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -794,11 +794,15 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 {
 	struct ixgbe_q_vector *q_vector;
 	struct ixgbe_ring *ring;
-	int node = NUMA_NO_NODE;
-	int cpu = -1;
+	int node = adapter->pdev->dev.numa_node;
+	int cpu, set_affinity = 0;
 	int ring_count, size;
 	u8 tcs = netdev_get_num_tc(adapter->netdev);
 
+	if (node == NUMA_NO_NODE)
+		cpu = -1;
+	else
+		cpu = cpumask_next(v_idx - 1, cpumask_of_node(node));
 	ring_count = txr_count + rxr_count;
 	size = sizeof(struct ixgbe_q_vector) +
 	       (sizeof(struct ixgbe_ring) * ring_count);
@@ -807,10 +811,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) {
 		u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
 		if (rss_i > 1 && adapter->atr_sample_rate) {
-			if (cpu_online(v_idx)) {
-				cpu = v_idx;
-				node = cpu_to_node(cpu);
-			}
+			if (likely(cpu_online(cpu)))
+				set_affinity = 1;
 		}
 	}
 
@@ -822,7 +824,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 		return -ENOMEM;
 
 	/* setup affinity mask and node */
-	if (cpu != -1)
+	if (set_affinity)
 		cpumask_set_cpu(cpu, &q_vector->affinity_mask);
 	q_vector->numa_node = node;
 
-- 
1.7.9.3


------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-24 18:51 [PATCH 0/2] ixgbe, fix numa issues Prarit Bhargava
  2014-02-24 18:51 ` [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware Prarit Bhargava
  2014-02-24 18:51 ` [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus Prarit Bhargava
@ 2014-02-24 19:23 ` Alexander Duyck
  2014-02-24 19:34   ` Prarit Bhargava
  2 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2014-02-24 19:23 UTC (permalink / raw)
  To: Prarit Bhargava, netdev
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Carolyn Wyborny,
	Don Skidmore, Greg Rose, John Ronciak, Mitch Williams,
	David S. Miller, nhorman, agospoda, e1000-devel

On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
> The ixgbe driver makes some assumptions about the layout of cpus in the
> system which are not always correct given a particular system layout.  The
> ixgbe driver allocates one MSI/cpu for queue usage but the code does not take
> into account that devices are located on NUMA nodes and that the cpus in a node
> are not contiguous.
>
> These issues were found while doing cpu hotplug testing, however, both of these
> issues can lead to obvious system performance issues as they defeat the
> purpose of having one MSI processing a queue per cpu.
>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> Cc: Mitch Williams <mitch.a.williams@intel.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: nhorman@redhat.com
> Cc: agospoda@redhat.com
> Cc: e1000-devel@lists.sourceforge.net
>
> Prarit Bhargava (2):
>   ixgbe, make interrupt allocations NUMA aware
>   ixgbe, don't assume mapping of numa node cpus
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   44 ++++++++++++++++++------
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +--
>  4 files changed, 42 insertions(+), 15 deletions(-)
>

This is a step in the right direction but totally defeats the purpose of
ATR.  With this change we might as well defeature ATR all together since
things are now back to RSS w/ NUMA specific allocations which is what we
had a couple of years ago.  The code as it is written now would be a
better for for igb which doesn't have ATR than ixgbe.

ATR is supposed to map 1:1 queues to CPUs.  The problem is RSS is also a
factor and not especially smart or NUMA aware.  The ideal solution would
be to allocate the first N CPUs, where N is the number in the local node
for ATR/RSS.  Then map all other queues as ATR with a 1:1 mapping to CPUs.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware
  2014-02-24 18:51 ` [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware Prarit Bhargava
@ 2014-02-24 19:26   ` Alexander Duyck
  2014-02-24 19:39     ` Prarit Bhargava
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2014-02-24 19:26 UTC (permalink / raw)
  To: Prarit Bhargava, netdev
  Cc: nhorman, e1000-devel, Bruce Allan, Jesse Brandeburg, agospoda,
	John Ronciak, David S. Miller

On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
> The ixgbe driver creates one queue/cpu on the system in order to spread
> work out on all cpus rather than restricting work to a single cpu.  This
> model, while efficient, does not take into account the NUMA configuration
> of the system.
>
> This patch introduces ixgbe_num_cpus() which returns
> the number of online cpus if the adapter's PCI device has no NUMA
> restrictions, and the number of cpus in the node if the PCI device is
> allocated to a specific node.
>
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> Cc: Mitch Williams <mitch.a.williams@intel.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: nhorman@redhat.com
> Cc: agospoda@redhat.com
> Cc: e1000-devel@lists.sourceforge.net
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   28 +++++++++++++++++++++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +++--
>  4 files changed, 33 insertions(+), 8 deletions(-)
>

[...]

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 18076c4..b68a6e9 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -4953,13 +4953,13 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
>  	hw->subsystem_device_id = pdev->subsystem_device;
>  
>  	/* Set common capability flags and settings */
> -	rss = min_t(int, IXGBE_MAX_RSS_INDICES, num_online_cpus());
> +	rss = min_t(int, IXGBE_MAX_RSS_INDICES, ixgbe_num_cpus(adapter));
>  	adapter->ring_feature[RING_F_RSS].limit = rss;
>  	adapter->flags2 |= IXGBE_FLAG2_RSC_CAPABLE;
>  	adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
>  	adapter->max_q_vectors = MAX_Q_VECTORS_82599;
>  	adapter->atr_sample_rate = 20;
> -	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
> +	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, ixgbe_num_cpus(adapter));
>  	adapter->ring_feature[RING_F_FDIR].limit = fdir;
>  	adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
>  #ifdef CONFIG_IXGBE_DCA

This is the one bit I object to in this patch.  The flow director queue
count should be equal to the number of online CPUs, or at least as close
to it as the hardware can get.  Otherwise ATR is completely useless.

Thanks,

Alex

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-24 19:23 ` [PATCH 0/2] ixgbe, fix numa issues Alexander Duyck
@ 2014-02-24 19:34   ` Prarit Bhargava
  2014-02-24 19:57     ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-24 19:34 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, John Ronciak,
	Mitch Williams, David S. Miller, nhorman, agospoda, e1000-devel



On 02/24/2014 02:23 PM, Alexander Duyck wrote:
> On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
>> The ixgbe driver makes some assumptions about the layout of cpus in the
>> system which are not always correct given a particular system layout.  The
>> ixgbe driver allocates one MSI/cpu for queue usage but the code does not take
>> into account that devices are located on NUMA nodes and that the cpus in a node
>> are not contiguous.
>>
>> These issues were found while doing cpu hotplug testing, however, both of these
>> issues can lead to obvious system performance issues as they defeat the
>> purpose of having one MSI processing a queue per cpu.
>>
>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> Cc: Bruce Allan <bruce.w.allan@intel.com>
>> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
>> Cc: Don Skidmore <donald.c.skidmore@intel.com>
>> Cc: Greg Rose <gregory.v.rose@intel.com>
>> Cc: Alex Duyck <alexander.h.duyck@intel.com>
>> Cc: John Ronciak <john.ronciak@intel.com>
>> Cc: Mitch Williams <mitch.a.williams@intel.com>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: nhorman@redhat.com
>> Cc: agospoda@redhat.com
>> Cc: e1000-devel@lists.sourceforge.net
>>
>> Prarit Bhargava (2):
>>   ixgbe, make interrupt allocations NUMA aware
>>   ixgbe, don't assume mapping of numa node cpus
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   44 ++++++++++++++++++------
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++--
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +--
>>  4 files changed, 42 insertions(+), 15 deletions(-)
>>
> 
> This is a step in the right direction but totally defeats the purpose of
> ATR.  With this change we might as well defeature ATR all together since

ATR?  First hit on google is Americans for Tax Reform ;)

> things are now back to RSS w/ NUMA specific allocations which is what we
> had a couple of years ago.  The code as it is written now would be a
> better for for igb which doesn't have ATR than ixgbe.

The big(ger) problem here is that the ixgbe (and other drivers IIUC) do not do a
good job of handling MSIs, making sure they are launched on the right cpus, and
cleaning up during cpu hotplug operations.  This code looks like it needs a bit
of work so your advice is appreciated.

> 
> ATR is supposed to map 1:1 queues to CPUs.  The problem is RSS is also a
> factor and not especially smart or NUMA aware.  The ideal solution would
> be to allocate the first N CPUs, where N is the number in the local node
> for ATR/RSS.  

Okay ... I'll look into that.

Then map all other queues as ATR with a 1:1 mapping to CPUs.
> 

Hmmm ... but what if off-node CPUs cannot reach the device?  Part of the puzzle
here is that ACPI may be not only telling us that the device is on a specific
node, but that the device is physically separated on a root bus.

P.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus
  2014-02-24 18:51 ` [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus Prarit Bhargava
@ 2014-02-24 19:39   ` Alexander Duyck
  2014-02-25 17:27   ` Amir Vadai
  1 sibling, 0 replies; 17+ messages in thread
From: Alexander Duyck @ 2014-02-24 19:39 UTC (permalink / raw)
  To: Prarit Bhargava, netdev
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Carolyn Wyborny,
	Don Skidmore, Greg Rose, John Ronciak, Mitch Williams,
	David S. Miller, nhorman, agospoda, e1000-devel

On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
> The ixgbe driver assumes that the cpus on a node are mapped 1:1 with the
> indexes into arrays.  This is not the case as nodes can contain, for
> example, cpus 0-7, 33-40.
> 
> This patch fixes this problem.
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> Cc: Mitch Williams <mitch.a.williams@intel.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: nhorman@redhat.com
> Cc: agospoda@redhat.com
> Cc: e1000-devel@lists.sourceforge.net
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |   16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> index 3668288..8b3992e 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> @@ -794,11 +794,15 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
>  {
>  	struct ixgbe_q_vector *q_vector;
>  	struct ixgbe_ring *ring;
> -	int node = NUMA_NO_NODE;
> -	int cpu = -1;
> +	int node = adapter->pdev->dev.numa_node;
> +	int cpu, set_affinity = 0;
>  	int ring_count, size;
>  	u8 tcs = netdev_get_num_tc(adapter->netdev);
>  
> +	if (node == NUMA_NO_NODE)
> +		cpu = -1;
> +	else
> +		cpu = cpumask_next(v_idx - 1, cpumask_of_node(node));
>  	ring_count = txr_count + rxr_count;
>  	size = sizeof(struct ixgbe_q_vector) +
>  	       (sizeof(struct ixgbe_ring) * ring_count);

Are you sure this does what you think it does?  I thought the first
value is just the starting offset to check for a bit?  I don't think
cpumask_next is aware of holes in a given mask.  So for example if 8-31
are missing I think you will end up with all of your CPUs being
allocated to CPU 32 since it is the first set bit greater than 15.

What might work better here is a function that returns the local node
CPU IDs first, followed by the remote node IDs if ATR is enabled.  We
should probably have it configured to loop in the case that the number
of queues is greater than local nodes, but ATR is not enabled.

> @@ -807,10 +811,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
>  	if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) {
>  		u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
>  		if (rss_i > 1 && adapter->atr_sample_rate) {
> -			if (cpu_online(v_idx)) {
> -				cpu = v_idx;
> -				node = cpu_to_node(cpu);
> -			}
> +			if (likely(cpu_online(cpu)))
> +				set_affinity = 1;
>  		}
>  	}
>  

The node assignment is still needed here.  We need to be able to assign
queues to remote nodes as applications will be there expecting data.  We
have seen a serious performance degradation when trying to feed an
application from a remote queue even if the queue is local to hardware.

> @@ -822,7 +824,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
>  		return -ENOMEM;
>  
>  	/* setup affinity mask and node */
> -	if (cpu != -1)
> +	if (set_affinity)
>  		cpumask_set_cpu(cpu, &q_vector->affinity_mask);
>  	q_vector->numa_node = node;
>  
> 

I'm not sure what the point of this change is other than the fact that
you changed the cpu configuration to be earlier.  The affinity mask
could be configured with an offline CPU and it should have no negative
affect.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware
  2014-02-24 19:26   ` Alexander Duyck
@ 2014-02-24 19:39     ` Prarit Bhargava
  2014-02-24 19:49       ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-24 19:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: nhorman, e1000-devel, netdev, Bruce Allan, Jesse Brandeburg,
	agospoda, John Ronciak, David S. Miller



On 02/24/2014 02:26 PM, Alexander Duyck wrote:
> On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
>> The ixgbe driver creates one queue/cpu on the system in order to spread
>> work out on all cpus rather than restricting work to a single cpu.  This
>> model, while efficient, does not take into account the NUMA configuration
>> of the system.
>>
>> This patch introduces ixgbe_num_cpus() which returns
>> the number of online cpus if the adapter's PCI device has no NUMA
>> restrictions, and the number of cpus in the node if the PCI device is
>> allocated to a specific node.
>>
>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> Cc: Bruce Allan <bruce.w.allan@intel.com>
>> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
>> Cc: Don Skidmore <donald.c.skidmore@intel.com>
>> Cc: Greg Rose <gregory.v.rose@intel.com>
>> Cc: Alex Duyck <alexander.h.duyck@intel.com>
>> Cc: John Ronciak <john.ronciak@intel.com>
>> Cc: Mitch Williams <mitch.a.williams@intel.com>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: nhorman@redhat.com
>> Cc: agospoda@redhat.com
>> Cc: e1000-devel@lists.sourceforge.net
>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   28 +++++++++++++++++++++---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +++--
>>  4 files changed, 33 insertions(+), 8 deletions(-)
>>
> 
> [...]
> 
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 18076c4..b68a6e9 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -4953,13 +4953,13 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
>>  	hw->subsystem_device_id = pdev->subsystem_device;
>>  
>>  	/* Set common capability flags and settings */
>> -	rss = min_t(int, IXGBE_MAX_RSS_INDICES, num_online_cpus());
>> +	rss = min_t(int, IXGBE_MAX_RSS_INDICES, ixgbe_num_cpus(adapter));
>>  	adapter->ring_feature[RING_F_RSS].limit = rss;
>>  	adapter->flags2 |= IXGBE_FLAG2_RSC_CAPABLE;
>>  	adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
>>  	adapter->max_q_vectors = MAX_Q_VECTORS_82599;
>>  	adapter->atr_sample_rate = 20;
>> -	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
>> +	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, ixgbe_num_cpus(adapter));
>>  	adapter->ring_feature[RING_F_FDIR].limit = fdir;
>>  	adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
>>  #ifdef CONFIG_IXGBE_DCA
> 
> This is the one bit I object to in this patch.  The flow director queue
> count should be equal to the number of online CPUs, or at least as close
> to it as the hardware can get.  Otherwise ATR is completely useless.

I'm reading up on ATR now and I see your point completely.  I will remove this
chunk in V2.  OOC, however, what about my concern with ATR & the location of the
PCI device (on a different root bridge)?  Isn't that a concern with ATR or am I
missing something with the overall scheme of ATR?

P.

> 
> Thanks,
> 
> Alex

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware
  2014-02-24 19:39     ` Prarit Bhargava
@ 2014-02-24 19:49       ` Alexander Duyck
  0 siblings, 0 replies; 17+ messages in thread
From: Alexander Duyck @ 2014-02-24 19:49 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, John Ronciak,
	Mitch Williams, David S. Miller, nhorman, agospoda, e1000-devel

On 02/24/2014 11:39 AM, Prarit Bhargava wrote:
> 
> 
> On 02/24/2014 02:26 PM, Alexander Duyck wrote:
>> On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
>>> The ixgbe driver creates one queue/cpu on the system in order to spread
>>> work out on all cpus rather than restricting work to a single cpu.  This
>>> model, while efficient, does not take into account the NUMA configuration
>>> of the system.
>>>
>>> This patch introduces ixgbe_num_cpus() which returns
>>> the number of online cpus if the adapter's PCI device has no NUMA
>>> restrictions, and the number of cpus in the node if the PCI device is
>>> allocated to a specific node.
>>>
>>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> Cc: Bruce Allan <bruce.w.allan@intel.com>
>>> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
>>> Cc: Don Skidmore <donald.c.skidmore@intel.com>
>>> Cc: Greg Rose <gregory.v.rose@intel.com>
>>> Cc: Alex Duyck <alexander.h.duyck@intel.com>
>>> Cc: John Ronciak <john.ronciak@intel.com>
>>> Cc: Mitch Williams <mitch.a.williams@intel.com>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: nhorman@redhat.com
>>> Cc: agospoda@redhat.com
>>> Cc: e1000-devel@lists.sourceforge.net
>>> ---
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   28 +++++++++++++++++++++---
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++---
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +++--
>>>  4 files changed, 33 insertions(+), 8 deletions(-)
>>>
>>
>> [...]
>>
>>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> index 18076c4..b68a6e9 100644
>>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>>> @@ -4953,13 +4953,13 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
>>>  	hw->subsystem_device_id = pdev->subsystem_device;
>>>  
>>>  	/* Set common capability flags and settings */
>>> -	rss = min_t(int, IXGBE_MAX_RSS_INDICES, num_online_cpus());
>>> +	rss = min_t(int, IXGBE_MAX_RSS_INDICES, ixgbe_num_cpus(adapter));
>>>  	adapter->ring_feature[RING_F_RSS].limit = rss;
>>>  	adapter->flags2 |= IXGBE_FLAG2_RSC_CAPABLE;
>>>  	adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
>>>  	adapter->max_q_vectors = MAX_Q_VECTORS_82599;
>>>  	adapter->atr_sample_rate = 20;
>>> -	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, num_online_cpus());
>>> +	fdir = min_t(int, IXGBE_MAX_FDIR_INDICES, ixgbe_num_cpus(adapter));
>>>  	adapter->ring_feature[RING_F_FDIR].limit = fdir;
>>>  	adapter->fdir_pballoc = IXGBE_FDIR_PBALLOC_64K;
>>>  #ifdef CONFIG_IXGBE_DCA
>>
>> This is the one bit I object to in this patch.  The flow director queue
>> count should be equal to the number of online CPUs, or at least as close
>> to it as the hardware can get.  Otherwise ATR is completely useless.
> 
> I'm reading up on ATR now and I see your point completely.  I will remove this
> chunk in V2.  OOC, however, what about my concern with ATR & the location of the
> PCI device (on a different root bridge)?  Isn't that a concern with ATR or am I
> missing something with the overall scheme of ATR?
> 
> P.
> 

The advantage to ATR is that it knows where the application requesting
the packet data resides.  The applications on remote nodes still need
access to the device and the only means of getting to it is through
memory.  If the root complex is on one node and the memory/CPU is on
another it is still cheaper to have the device push the descriptor and
packet to the memory/CPU then to have the CPU have to fetch it from our
local nodes memory and then copy it into the application memory.

RSS which is the fallback if we don't have ATR isn't application aware
so in the case of RSS we probably want to just process all of the
requests locally and hope for the best since we don't know what node the
data will eventually end up on.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-24 19:34   ` Prarit Bhargava
@ 2014-02-24 19:57     ` Alexander Duyck
  2014-02-25  1:06       ` Prarit Bhargava
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2014-02-24 19:57 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, John Ronciak,
	Mitch Williams, David S. Miller, nhorman, agospoda, e1000-devel

On 02/24/2014 11:34 AM, Prarit Bhargava wrote:
> 
> 
> On 02/24/2014 02:23 PM, Alexander Duyck wrote:
>> On 02/24/2014 10:51 AM, Prarit Bhargava wrote:
>>> The ixgbe driver makes some assumptions about the layout of cpus in the
>>> system which are not always correct given a particular system layout.  The
>>> ixgbe driver allocates one MSI/cpu for queue usage but the code does not take
>>> into account that devices are located on NUMA nodes and that the cpus in a node
>>> are not contiguous.
>>>
>>> These issues were found while doing cpu hotplug testing, however, both of these
>>> issues can lead to obvious system performance issues as they defeat the
>>> purpose of having one MSI processing a queue per cpu.
>>>
>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
>>> Cc: Bruce Allan <bruce.w.allan@intel.com>
>>> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
>>> Cc: Don Skidmore <donald.c.skidmore@intel.com>
>>> Cc: Greg Rose <gregory.v.rose@intel.com>
>>> Cc: Alex Duyck <alexander.h.duyck@intel.com>
>>> Cc: John Ronciak <john.ronciak@intel.com>
>>> Cc: Mitch Williams <mitch.a.williams@intel.com>
>>> Cc: "David S. Miller" <davem@davemloft.net>
>>> Cc: nhorman@redhat.com
>>> Cc: agospoda@redhat.com
>>> Cc: e1000-devel@lists.sourceforge.net
>>>
>>> Prarit Bhargava (2):
>>>   ixgbe, make interrupt allocations NUMA aware
>>>   ixgbe, don't assume mapping of numa node cpus
>>>
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |    2 ++
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |   44 ++++++++++++++++++------
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |    6 ++--
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c |    5 +--
>>>  4 files changed, 42 insertions(+), 15 deletions(-)
>>>
>>
>> This is a step in the right direction but totally defeats the purpose of
>> ATR.  With this change we might as well defeature ATR all together since
> 
> ATR?  First hit on google is Americans for Tax Reform ;)

Application Targeted Routing.  Meaning the ixgbe driver has clues about
which CPUs the applications belonging to a given flow live on.

> 
>> things are now back to RSS w/ NUMA specific allocations which is what we
>> had a couple of years ago.  The code as it is written now would be a
>> better for for igb which doesn't have ATR than ixgbe.
> 
> The big(ger) problem here is that the ixgbe (and other drivers IIUC) do not do a
> good job of handling MSIs, making sure they are launched on the right cpus, and
> cleaning up during cpu hotplug operations.  This code looks like it needs a bit
> of work so your advice is appreciated.
> 

I'm not seeing how this helps with hotplug since the driver doesn't get
notifications of any such event anyway, at least not currently.  Are
there some changes coming in that regard?

>>
>> ATR is supposed to map 1:1 queues to CPUs.  The problem is RSS is also a
>> factor and not especially smart or NUMA aware.  The ideal solution would
>> be to allocate the first N CPUs, where N is the number in the local node
>> for ATR/RSS.  
> 
> Okay ... I'll look into that.
> 
> Then map all other queues as ATR with a 1:1 mapping to CPUs.
>>
> 
> Hmmm ... but what if off-node CPUs cannot reach the device?  Part of the puzzle
> here is that ACPI may be not only telling us that the device is on a specific
> node, but that the device is physically separated on a root bus.
> 
> P.
> 

As far as I know there shouldn't be anything that explicitly prevents us
from reading/writing to any memory in the system.  In the case of the
Intel NUMA configurations anyway we just can use the QPI bus if the CPU
isn't connected to the same root complex.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-24 19:57     ` Alexander Duyck
@ 2014-02-25  1:06       ` Prarit Bhargava
  2014-02-25 10:21         ` David Laight
  0 siblings, 1 reply; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-25  1:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, John Ronciak,
	Mitch Williams, David S. Miller, nhorman, agospoda, e1000-devel



On 02/24/2014 02:57 PM, Alexander Duyck wrote:

>>
>> The big(ger) problem here is that the ixgbe (and other drivers IIUC) do not do a
>> good job of handling MSIs, making sure they are launched on the right cpus, and
>> cleaning up during cpu hotplug operations.  This code looks like it needs a bit
>> of work so your advice is appreciated.
>>
> 
> I'm not seeing how this helps with hotplug since the driver doesn't get
> notifications of any such event anyway, at least not currently.  Are
> there some changes coming in that regard?

There have been some issues with CPU hotplug.  The problem started a while ago
with several different vendors systems being unable to perform cpu hotplug down
without running into strange storage issues.  I chased this down and came up
with linux.git commit da6139e49c7cb0f4251265cb5243b8d220adb48d, x86: Add check
for number of available vectors before CPU down [1].

What has caused that check to be necessary is that the ixgbe driver is now
allocating so many interrupts that on large systems which full sockets are taken
in and out of service, it is possible that there are not enough empty vectors
for all the irqs on a down'd cpu.  IMO what the ixgbe driver is effectively
doing is starving the system of resources.  If I rmmod the ixgbe driver (and
free it's irqs of course) I have no problem in taking all cpus except 1 out of
service.

I've started coding a cpu notifier which would take a queue out of service when
a cpu goes down but I hit these smaller issues which have to be fixed first
before I come up with the notifier.

P.

[1] Please note that even this is incorrect.  I don't properly take into account
node-specific interrupts which can only be moved to specific cpus because of the
affinity settings of MSI.  I'm working on a patch right now to fix that issue.


> 
>>>
>>> ATR is supposed to map 1:1 queues to CPUs.  The problem is RSS is also a
>>> factor and not especially smart or NUMA aware.  The ideal solution would
>>> be to allocate the first N CPUs, where N is the number in the local node
>>> for ATR/RSS.  
>>
>> Okay ... I'll look into that.
>>
>> Then map all other queues as ATR with a 1:1 mapping to CPUs.
>>>
>>
>> Hmmm ... but what if off-node CPUs cannot reach the device?  Part of the puzzle
>> here is that ACPI may be not only telling us that the device is on a specific
>> node, but that the device is physically separated on a root bus.
>>
>> P.
>>
> 
> As far as I know there shouldn't be anything that explicitly prevents us
> from reading/writing to any memory in the system.  In the case of the
> Intel NUMA configurations anyway we just can use the QPI bus if the CPU
> isn't connected to the same root complex.


> 
> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-25  1:06       ` Prarit Bhargava
@ 2014-02-25 10:21         ` David Laight
  2014-02-25 11:00           ` Prarit Bhargava
  0 siblings, 1 reply; 17+ messages in thread
From: David Laight @ 2014-02-25 10:21 UTC (permalink / raw)
  To: 'Prarit Bhargava', Alexander Duyck
  Cc: nhorman, e1000-devel, netdev, Bruce Allan, Jesse Brandeburg,
	agospoda, John Ronciak, David S. Miller

From: Prarit Bhargava
...
> What has caused that check to be necessary is that the ixgbe driver is now
> allocating so many interrupts that on large systems which full sockets are taken
> in and out of service, it is possible that there are not enough empty vectors
> for all the irqs on a down'd cpu.  IMO what the ixgbe driver is effectively
> doing is starving the system of resources.  If I rmmod the ixgbe driver (and
> free it's irqs of course) I have no problem in taking all cpus except 1 out of
> service.

If I read that correctly it looks as though ixgbe should be allocating
a number of interrupts on each cpu - for the interrupts it wants to take
on that cpu.

Then taking the cpu out of service would 'just' require that the interrupts
that are tied to that cpu be removed first?

	David




------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-25 10:21         ` David Laight
@ 2014-02-25 11:00           ` Prarit Bhargava
  2014-02-25 15:10             ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-25 11:00 UTC (permalink / raw)
  To: David Laight
  Cc: nhorman, e1000-devel, netdev, Bruce Allan, Jesse Brandeburg,
	agospoda, John Ronciak, David S. Miller



On 02/25/2014 05:21 AM, David Laight wrote:
> From: Prarit Bhargava
> ...
>> What has caused that check to be necessary is that the ixgbe driver is now
>> allocating so many interrupts that on large systems which full sockets are taken
>> in and out of service, it is possible that there are not enough empty vectors
>> for all the irqs on a down'd cpu.  IMO what the ixgbe driver is effectively
>> doing is starving the system of resources.  If I rmmod the ixgbe driver (and
>> free it's irqs of course) I have no problem in taking all cpus except 1 out of
>> service.
> 
> If I read that correctly it looks as though ixgbe should be allocating
> a number of interrupts on each cpu - for the interrupts it wants to take
> on that cpu.

Yes, the code currently does it.

> 
> Then taking the cpu out of service would 'just' require that the interrupts
> that are tied to that cpu be removed first?

Yes, that would happen with a cpu notifier (I've already written a simple dummy
one that just printk's when called).  I started to implement a single queue
teardown but hit some of these enumeration issues.  I'd like to fix these first
and then get to the teardown.

P.

> 
> 	David
> 
> 
> 

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-25 11:00           ` Prarit Bhargava
@ 2014-02-25 15:10             ` Alexander Duyck
  2014-02-25 15:13               ` Prarit Bhargava
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2014-02-25 15:10 UTC (permalink / raw)
  To: Prarit Bhargava, David Laight
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, John Ronciak,
	Mitch Williams, David S. Miller, nhorman, agospoda, e1000-devel

On 02/25/2014 03:00 AM, Prarit Bhargava wrote:
>
> On 02/25/2014 05:21 AM, David Laight wrote:
>> From: Prarit Bhargava
>> ...
>>> What has caused that check to be necessary is that the ixgbe driver is now
>>> allocating so many interrupts that on large systems which full sockets are taken
>>> in and out of service, it is possible that there are not enough empty vectors
>>> for all the irqs on a down'd cpu.  IMO what the ixgbe driver is effectively
>>> doing is starving the system of resources.  If I rmmod the ixgbe driver (and
>>> free it's irqs of course) I have no problem in taking all cpus except 1 out of
>>> service.
>> If I read that correctly it looks as though ixgbe should be allocating
>> a number of interrupts on each cpu - for the interrupts it wants to take
>> on that cpu.
> Yes, the code currently does it.
>
>> Then taking the cpu out of service would 'just' require that the interrupts
>> that are tied to that cpu be removed first?
> Yes, that would happen with a cpu notifier (I've already written a simple dummy
> one that just printk's when called).  I started to implement a single queue
> teardown but hit some of these enumeration issues.  I'd like to fix these first
> and then get to the teardown.
>
> P.
>
>

What should happen if you attempt to remove the CPU the root complex is
attached to?  Will that trigger a remove via the PCIe complex being removed?

Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] ixgbe, fix numa issues
  2014-02-25 15:10             ` Alexander Duyck
@ 2014-02-25 15:13               ` Prarit Bhargava
  0 siblings, 0 replies; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-25 15:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, netdev, Jeff Kirsher, Jesse Brandeburg,
	Bruce Allan, Carolyn Wyborny, Don Skidmore, Greg Rose,
	John Ronciak, Mitch Williams, David S. Miller, nhorman, agospoda,
	e1000-devel



On 02/25/2014 10:10 AM, Alexander Duyck wrote:
> On 02/25/2014 03:00 AM, Prarit Bhargava wrote:
>>
>> On 02/25/2014 05:21 AM, David Laight wrote:
>>> From: Prarit Bhargava
>>> ...
>>>> What has caused that check to be necessary is that the ixgbe driver is now
>>>> allocating so many interrupts that on large systems which full sockets are taken
>>>> in and out of service, it is possible that there are not enough empty vectors
>>>> for all the irqs on a down'd cpu.  IMO what the ixgbe driver is effectively
>>>> doing is starving the system of resources.  If I rmmod the ixgbe driver (and
>>>> free it's irqs of course) I have no problem in taking all cpus except 1 out of
>>>> service.
>>> If I read that correctly it looks as though ixgbe should be allocating
>>> a number of interrupts on each cpu - for the interrupts it wants to take
>>> on that cpu.
>> Yes, the code currently does it.
>>
>>> Then taking the cpu out of service would 'just' require that the interrupts
>>> that are tied to that cpu be removed first?
>> Yes, that would happen with a cpu notifier (I've already written a simple dummy
>> one that just printk's when called).  I started to implement a single queue
>> teardown but hit some of these enumeration issues.  I'd like to fix these first
>> and then get to the teardown.
>>
>> P.
>>
>>
> 
> What should happen if you attempt to remove the CPU the root complex is
> attached to?  Will that trigger a remove via the PCIe complex being removed?

I haven't tried that yet :), but my understanding is that the remove will be
triggered.

P.

> 
> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ixgbe, don't assume mapping of numa node cpus
  2014-02-24 18:51 ` [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus Prarit Bhargava
  2014-02-24 19:39   ` Alexander Duyck
@ 2014-02-25 17:27   ` Amir Vadai
  2014-02-25 17:43     ` Prarit Bhargava
  1 sibling, 1 reply; 17+ messages in thread
From: Amir Vadai @ 2014-02-25 17:27 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: netdev, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Carolyn Wyborny, Don Skidmore, Greg Rose, Alex Duyck,
	John Ronciak, Mitch Williams, David S. Miller, nhorman, agospoda,
	e1000-devel, idos, Yevgeny Petrilin, Or Gerlitz

On 24/02/14 13:51 -0500, Prarit Bhargava wrote:
> The ixgbe driver assumes that the cpus on a node are mapped 1:1 with the
> indexes into arrays.  This is not the case as nodes can contain, for
> example, cpus 0-7, 33-40.
> 
> This patch fixes this problem.
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Cc: Bruce Allan <bruce.w.allan@intel.com>
> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
> Cc: Don Skidmore <donald.c.skidmore@intel.com>
> Cc: Greg Rose <gregory.v.rose@intel.com>
> Cc: Alex Duyck <alexander.h.duyck@intel.com>
> Cc: John Ronciak <john.ronciak@intel.com>
> Cc: Mitch Williams <mitch.a.williams@intel.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: nhorman@redhat.com
> Cc: agospoda@redhat.com
> Cc: e1000-devel@lists.sourceforge.net
> ---

Hi,

I'm just about to send tomorrow a patch to add an helper function to
get affinity_hint suggestion by numa_node and ring index.
If you'd like you will be able to use it too here.

We're still doing internal review on it before sending to the mailing
list, but this will be the declaration of the function:
/*
 * netif_set_rx_queue_affinity_hint - set affinity hint of rx queue
 * @rxq: index of rx queue
 * @numa_node: prefered numa_node
 * @affinity_mask: the relevant cpu bit is set according to the policy
 *
 * This function sets the affinity_mask according to a numa aware policy.
 * affinity_mask coulbe used as an affinity hint for the IRQ related of this
 * rx queue.
 * The policy is to spread rx queues across cores - local cores first.
 *
 * Returns 0 on success, or a negative error code.
 */
int netif_set_rx_queue_affinity_hint(int rxq, int numa_node,
                                     cpumask_var_t affinity_mask);



Amir

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: ixgbe, don't assume mapping of numa node cpus
  2014-02-25 17:27   ` Amir Vadai
@ 2014-02-25 17:43     ` Prarit Bhargava
  0 siblings, 0 replies; 17+ messages in thread
From: Prarit Bhargava @ 2014-02-25 17:43 UTC (permalink / raw)
  To: Amir Vadai
  Cc: nhorman, idos, Yevgeny Petrilin, e1000-devel, netdev,
	Bruce Allan, Jesse Brandeburg, agospoda, John Ronciak,
	David S. Miller, Or Gerlitz



On 02/25/2014 12:27 PM, Amir Vadai wrote:
> On 24/02/14 13:51 -0500, Prarit Bhargava wrote:
>> The ixgbe driver assumes that the cpus on a node are mapped 1:1 with the
>> indexes into arrays.  This is not the case as nodes can contain, for
>> example, cpus 0-7, 33-40.
>>
>> This patch fixes this problem.
>>
>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> Cc: Bruce Allan <bruce.w.allan@intel.com>
>> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com>
>> Cc: Don Skidmore <donald.c.skidmore@intel.com>
>> Cc: Greg Rose <gregory.v.rose@intel.com>
>> Cc: Alex Duyck <alexander.h.duyck@intel.com>
>> Cc: John Ronciak <john.ronciak@intel.com>
>> Cc: Mitch Williams <mitch.a.williams@intel.com>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: nhorman@redhat.com
>> Cc: agospoda@redhat.com
>> Cc: e1000-devel@lists.sourceforge.net
>> ---
> 
> Hi,
> 
> I'm just about to send tomorrow a patch to add an helper function to
> get affinity_hint suggestion by numa_node and ring index.
> If you'd like you will be able to use it too here.
> 
> We're still doing internal review on it before sending to the mailing
> list, but this will be the declaration of the function:
> /*
>  * netif_set_rx_queue_affinity_hint - set affinity hint of rx queue
>  * @rxq: index of rx queue
>  * @numa_node: prefered numa_node
>  * @affinity_mask: the relevant cpu bit is set according to the policy
>  *
>  * This function sets the affinity_mask according to a numa aware policy.
>  * affinity_mask coulbe used as an affinity hint for the IRQ related of this
>  * rx queue.
>  * The policy is to spread rx queues across cores - local cores first.
>  *
>  * Returns 0 on success, or a negative error code.
>  */
> int netif_set_rx_queue_affinity_hint(int rxq, int numa_node,
>                                      cpumask_var_t affinity_mask);

I'm going to wait for this patch then.  Amir, please cc me.

Thanks for the info,

P.

> 
> 
> 
> Amir

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-02-25 17:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-24 18:51 [PATCH 0/2] ixgbe, fix numa issues Prarit Bhargava
2014-02-24 18:51 ` [PATCH 1/2] ixgbe, make interrupt allocations NUMA aware Prarit Bhargava
2014-02-24 19:26   ` Alexander Duyck
2014-02-24 19:39     ` Prarit Bhargava
2014-02-24 19:49       ` Alexander Duyck
2014-02-24 18:51 ` [PATCH 2/2] ixgbe, don't assume mapping of numa node cpus Prarit Bhargava
2014-02-24 19:39   ` Alexander Duyck
2014-02-25 17:27   ` Amir Vadai
2014-02-25 17:43     ` Prarit Bhargava
2014-02-24 19:23 ` [PATCH 0/2] ixgbe, fix numa issues Alexander Duyck
2014-02-24 19:34   ` Prarit Bhargava
2014-02-24 19:57     ` Alexander Duyck
2014-02-25  1:06       ` Prarit Bhargava
2014-02-25 10:21         ` David Laight
2014-02-25 11:00           ` Prarit Bhargava
2014-02-25 15:10             ` Alexander Duyck
2014-02-25 15:13               ` Prarit Bhargava

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.