All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] sfc: optimize RXQs count and affinities
@ 2022-01-28 15:19 Íñigo Huguet
  2022-01-28 15:19 ` [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Íñigo Huguet @ 2022-01-28 15:19 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.

However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).

This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.

Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.

RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.

Íñigo Huguet (2):
  sfc: default config to 1 channel/core in local NUMA node only
  sfc: set affinity hints in local NUMA node only

 drivers/net/ethernet/sfc/efx_channels.c | 62 +++++++++++++++++--------
 1 file changed, 43 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-01-28 15:19 [PATCH net-next 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
@ 2022-01-28 15:19 ` Íñigo Huguet
  2022-01-28 22:27   ` Jakub Kicinski
  2022-01-28 15:19 ` [PATCH net-next 2/2] sfc: set affinity hints " Íñigo Huguet
  2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-01-28 15:19 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.

Fallback to all other online cores if there are not online CPUs in local
NUMA node.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/efx_channels.c | 50 ++++++++++++++++---------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
index ead550ae2709..ec6c2f231e73 100644
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -78,31 +78,48 @@ static const struct efx_channel_type efx_default_channel_type = {
  * INTERRUPTS
  *************/
 
-static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+static unsigned int count_online_cores(struct efx_nic *efx, bool local_node)
 {
-	cpumask_var_t thread_mask;
+	cpumask_var_t filter_mask;
 	unsigned int count;
 	int cpu;
+
+	if (unlikely(!zalloc_cpumask_var(&filter_mask, GFP_KERNEL))) {
+		netif_warn(efx, probe, efx->net_dev,
+			   "RSS disabled due to allocation failure\n");
+		return 1;
+	}
+
+	cpumask_copy(filter_mask, cpu_online_mask);
+	if (local_node) {
+		int numa_node = pcibus_to_node(efx->pci_dev->bus);
+
+		cpumask_and(filter_mask, filter_mask, cpumask_of_node(numa_node));
+	}
+
+	count = 0;
+	for_each_cpu(cpu, filter_mask) {
+		++count;
+		cpumask_andnot(filter_mask, filter_mask, topology_sibling_cpumask(cpu));
+	}
+
+	free_cpumask_var(filter_mask);
+
+	return count;
+}
+
+static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+{
+	unsigned int count;
 
 	if (rss_cpus) {
 		count = rss_cpus;
 	} else {
-		if (unlikely(!zalloc_cpumask_var(&thread_mask, GFP_KERNEL))) {
-			netif_warn(efx, probe, efx->net_dev,
-				   "RSS disabled due to allocation failure\n");
-			return 1;
-		}
-
-		count = 0;
-		for_each_online_cpu(cpu) {
-			if (!cpumask_test_cpu(cpu, thread_mask)) {
-				++count;
-				cpumask_or(thread_mask, thread_mask,
-					   topology_sibling_cpumask(cpu));
-			}
-		}
+		count = count_online_cores(efx, true);
 
-		free_cpumask_var(thread_mask);
+		/* If no online CPUs in local node, fallback to any online CPUs */
+		if (count == 0)
+			count = count_online_cores(efx, false);
 	}
 
 	if (count > EFX_MAX_RX_QUEUES) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 2/2] sfc: set affinity hints in local NUMA node only
  2022-01-28 15:19 [PATCH net-next 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2022-01-28 15:19 ` [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
@ 2022-01-28 15:19 ` Íñigo Huguet
  2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2 siblings, 0 replies; 18+ messages in thread
From: Íñigo Huguet @ 2022-01-28 15:19 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
   were assigned to hyperthreading siblings because of being in same
   NUMA node.
   Since the patch previous to this one, this did not longer happen
   with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
   decreasing performance too much (to less than half in some of my
   tests).

This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.

Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):

$ lscpu | grep -i numa
NUMA node(s):                    2
NUMA node0 CPU(s):               0-7,16-23
NUMA node1 CPU(s):               8-15,24-31

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31

CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/efx_channels.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
index ec6c2f231e73..ef3168fbb5a6 100644
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -387,10 +387,18 @@ void efx_set_interrupt_affinity(struct efx_nic *efx)
 {
 	struct efx_channel *channel;
 	unsigned int cpu;
+	int numa_node = pcibus_to_node(efx->pci_dev->bus);
+	const struct cpumask *numa_mask = cpumask_of_node(numa_node);
 
+	/* If no online CPUs in local node, fallback to any online CPU */
+	if (cpumask_first_and(cpu_online_mask, numa_mask) >= nr_cpu_ids)
+		numa_mask = cpu_online_mask;
+
+	cpu = -1;
 	efx_for_each_channel(channel, efx) {
-		cpu = cpumask_local_spread(channel->channel,
-					   pcibus_to_node(efx->pci_dev->bus));
+		cpu = cpumask_next_and(cpu, cpu_online_mask, numa_mask);
+		if (cpu >= nr_cpu_ids)
+			cpu = cpumask_first_and(cpu_online_mask, numa_mask);
 		irq_set_affinity_hint(channel->irq, cpumask_of(cpu));
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-01-28 15:19 ` [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
@ 2022-01-28 22:27   ` Jakub Kicinski
  2022-02-07 15:03     ` Íñigo Huguet
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2022-01-28 22:27 UTC (permalink / raw)
  To: Íñigo Huguet; +Cc: ecree.xilinx, habetsm.xilinx, davem, netdev

On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> Handling channels from CPUs in different NUMA node can penalize
> performance, so better configure only one channel per core in the same
> NUMA node than the NIC, and not per each core in the system.
> 
> Fallback to all other online cores if there are not online CPUs in local
> NUMA node.

I think we should make netif_get_num_default_rss_queues() do a similar
thing. Instead of min(8, num_online_cpus()) we should default to
num_cores / 2 (that's physical cores, not threads). From what I've seen
this appears to strike a good balance between wasting resources on
pointless queues per hyperthread, and scaling up for CPUs which have
many wimpy cores.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-01-28 22:27   ` Jakub Kicinski
@ 2022-02-07 15:03     ` Íñigo Huguet
  2022-02-07 16:53       ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-07 15:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> > Handling channels from CPUs in different NUMA node can penalize
> > performance, so better configure only one channel per core in the same
> > NUMA node than the NIC, and not per each core in the system.
> >
> > Fallback to all other online cores if there are not online CPUs in local
> > NUMA node.
>
> I think we should make netif_get_num_default_rss_queues() do a similar
> thing. Instead of min(8, num_online_cpus()) we should default to
> num_cores / 2 (that's physical cores, not threads). From what I've seen
> this appears to strike a good balance between wasting resources on
> pointless queues per hyperthread, and scaling up for CPUs which have
> many wimpy cores.
>

I have a few busy weeks coming, but I can do this after that.

With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
nodes, or just the plain number 2?


-- 
Íñigo Huguet


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-07 15:03     ` Íñigo Huguet
@ 2022-02-07 16:53       ` Jakub Kicinski
  2022-02-10  9:35         ` Íñigo Huguet
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2022-02-07 16:53 UTC (permalink / raw)
  To: Íñigo Huguet
  Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:  
> > > Handling channels from CPUs in different NUMA node can penalize
> > > performance, so better configure only one channel per core in the same
> > > NUMA node than the NIC, and not per each core in the system.
> > >
> > > Fallback to all other online cores if there are not online CPUs in local
> > > NUMA node.  
> >
> > I think we should make netif_get_num_default_rss_queues() do a similar
> > thing. Instead of min(8, num_online_cpus()) we should default to
> > num_cores / 2 (that's physical cores, not threads). From what I've seen
> > this appears to strike a good balance between wasting resources on
> > pointless queues per hyperthread, and scaling up for CPUs which have
> > many wimpy cores.
> >  
> 
> I have a few busy weeks coming, but I can do this after that.
> 
> With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> nodes, or just the plain number 2?

Plain number 2, it's just a heuristic which seems to work okay.
One queue per core (IOW without the /2) is still way too many queues
for normal DC workloads.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-07 16:53       ` Jakub Kicinski
@ 2022-02-10  9:35         ` Íñigo Huguet
  2022-02-10 16:22           ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-10  9:35 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> > On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> > > > Handling channels from CPUs in different NUMA node can penalize
> > > > performance, so better configure only one channel per core in the same
> > > > NUMA node than the NIC, and not per each core in the system.
> > > >
> > > > Fallback to all other online cores if there are not online CPUs in local
> > > > NUMA node.
> > >
> > > I think we should make netif_get_num_default_rss_queues() do a similar
> > > thing. Instead of min(8, num_online_cpus()) we should default to
> > > num_cores / 2 (that's physical cores, not threads). From what I've seen
> > > this appears to strike a good balance between wasting resources on
> > > pointless queues per hyperthread, and scaling up for CPUs which have
> > > many wimpy cores.
> > >
> >
> > I have a few busy weeks coming, but I can do this after that.
> >
> > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > nodes, or just the plain number 2?
>
> Plain number 2, it's just a heuristic which seems to work okay.
> One queue per core (IOW without the /2) is still way too many queues
> for normal DC workloads.
>

Maybe it's because of being quite special workloads, but I have
encountered problems related to queues in different NUMA nodes in 2
cases: XDP performance being almost half with more RX queues because
of being in different node (the example in my patches) and a customer
losing UDP packets which was solved reducing the number of RX queues
so all them are in the same node.

-- 
Íñigo Huguet


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-10  9:35         ` Íñigo Huguet
@ 2022-02-10 16:22           ` Jakub Kicinski
  2022-02-11 11:05             ` Íñigo Huguet
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2022-02-10 16:22 UTC (permalink / raw)
  To: Íñigo Huguet
  Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote:
> On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:  
> > > I have a few busy weeks coming, but I can do this after that.
> > >
> > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > > nodes, or just the plain number 2?  
> >
> > Plain number 2, it's just a heuristic which seems to work okay.
> > One queue per core (IOW without the /2) is still way too many queues
> > for normal DC workloads.
> 
> Maybe it's because of being quite special workloads, but I have
> encountered problems related to queues in different NUMA nodes in 2
> cases: XDP performance being almost half with more RX queues because
> of being in different node (the example in my patches) and a customer
> losing UDP packets which was solved reducing the number of RX queues
> so all them are in the same node.

Right, no argument, NUMA tuning will still be necessary. 
I'm primarily concerned about providing a sensible default
for workloads which are not network heavy and therefore
nobody spends time tuning their queue configuration.
Any network-heavy workload will likely always benefit from tuning.

The status quo is that our current default returned by
netif_get_num_default_rss_queues() is 8 which is inadequate 
for modern servers, and people end up implementing their own
logic in the drivers.

I'm fine with sfc doing its own thing (at least for now) and 
therefore your patches as they are, but for new drivers I want
to be able to recommend netif_get_num_default_rss_queues() with
a clear conscience.

Does that make sense?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-10 16:22           ` Jakub Kicinski
@ 2022-02-11 11:05             ` Íñigo Huguet
  2022-02-11 19:01               ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-11 11:05 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Thu, Feb 10, 2022 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote:
> > On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> > > > I have a few busy weeks coming, but I can do this after that.
> > > >
> > > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > > > nodes, or just the plain number 2?
> > >
> > > Plain number 2, it's just a heuristic which seems to work okay.
> > > One queue per core (IOW without the /2) is still way too many queues
> > > for normal DC workloads.
> >
> > Maybe it's because of being quite special workloads, but I have
> > encountered problems related to queues in different NUMA nodes in 2
> > cases: XDP performance being almost half with more RX queues because
> > of being in different node (the example in my patches) and a customer
> > losing UDP packets which was solved reducing the number of RX queues
> > so all them are in the same node.
>
> Right, no argument, NUMA tuning will still be necessary.
> I'm primarily concerned about providing a sensible default
> for workloads which are not network heavy and therefore
> nobody spends time tuning their queue configuration.
> Any network-heavy workload will likely always benefit from tuning.
>
> The status quo is that our current default returned by
> netif_get_num_default_rss_queues() is 8 which is inadequate
> for modern servers, and people end up implementing their own
> logic in the drivers.
>
> I'm fine with sfc doing its own thing (at least for now) and
> therefore your patches as they are, but for new drivers I want
> to be able to recommend netif_get_num_default_rss_queues() with
> a clear conscience.
>
> Does that make sense?
>

Totally. My comment was intended to be more like a question to see why
we should or shouldn't consider NUMA nodes in
netif_get_num_default_rss_queues. But now I understand your point
better.

However, would it make sense something like this for
netif_get_num_default_rss_queues, or it would be a bit overkill?
if the system has more than one NUMA node, allocate one queue per
physical core in local NUMA node.
else, allocate physical cores / 2

Another thing: this patch series appears in patchwork with state
"Changes Requested", but no changes have been requested, actually. Can
the state be changed so it has more visibility to get reviews?

Thanks!

-- 
Íñigo Huguet


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-11 11:05             ` Íñigo Huguet
@ 2022-02-11 19:01               ` Jakub Kicinski
  2022-02-14  7:22                 ` Íñigo Huguet
  2022-02-19 13:53                 ` Martin Habets
  0 siblings, 2 replies; 18+ messages in thread
From: Jakub Kicinski @ 2022-02-11 19:01 UTC (permalink / raw)
  To: Íñigo Huguet
  Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> Totally. My comment was intended to be more like a question to see why
> we should or shouldn't consider NUMA nodes in
> netif_get_num_default_rss_queues. But now I understand your point
> better.
> 
> However, would it make sense something like this for
> netif_get_num_default_rss_queues, or it would be a bit overkill?
> if the system has more than one NUMA node, allocate one queue per
> physical core in local NUMA node.
> else, allocate physical cores / 2

I don't have a strong opinion on the NUMA question, to be honest.
It gets complicated pretty quickly. If there is one NIC we may or 
may not want to divide - for pure packet forwarding sure, best if
its done on the node with the NIC, but that assumes the other node 
is idle or doing something else? How does it not need networking?

If each node has a separate NIC we should definitely divide. But
it's impossible to know the NIC count at the netdev level..

So my thinking was let's leave NUMA configurations to manual tuning.
If we don't do anything special for NUMA it's less likely someone will
tell us we did the wrong thing there :) But feel free to implement what
you suggested above.

One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs 
in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
potentially be problematic if we wanted to divide by number of nodes.
Maybe not as much if just dividing by 2.

> Another thing: this patch series appears in patchwork with state
> "Changes Requested", but no changes have been requested, actually. Can
> the state be changed so it has more visibility to get reviews?

I think resend would be best.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-11 19:01               ` Jakub Kicinski
@ 2022-02-14  7:22                 ` Íñigo Huguet
  2022-02-19 13:53                 ` Martin Habets
  1 sibling, 0 replies; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-14  7:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Edward Cree, habetsm.xilinx, David S. Miller, netdev

On Fri, Feb 11, 2022 at 8:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> > Totally. My comment was intended to be more like a question to see why
> > we should or shouldn't consider NUMA nodes in
> > netif_get_num_default_rss_queues. But now I understand your point
> > better.
> >
> > However, would it make sense something like this for
> > netif_get_num_default_rss_queues, or it would be a bit overkill?
> > if the system has more than one NUMA node, allocate one queue per
> > physical core in local NUMA node.
> > else, allocate physical cores / 2
>
> I don't have a strong opinion on the NUMA question, to be honest.
> It gets complicated pretty quickly. If there is one NIC we may or
> may not want to divide - for pure packet forwarding sure, best if
> its done on the node with the NIC, but that assumes the other node
> is idle or doing something else? How does it not need networking?
>
> If each node has a separate NIC we should definitely divide. But
> it's impossible to know the NIC count at the netdev level..
>
> So my thinking was let's leave NUMA configurations to manual tuning.
> If we don't do anything special for NUMA it's less likely someone will
> tell us we did the wrong thing there :) But feel free to implement what
> you suggested above.

Agreed, the more you try to be smart, the more less common case you
might fail to do it well.

If nobody else speaks in favor of my suggestion I will go the simpler way.

>
> One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs
> in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
> potentially be problematic if we wanted to divide by number of nodes.
> Maybe not as much if just dividing by 2.
>
> > Another thing: this patch series appears in patchwork with state
> > "Changes Requested", but no changes have been requested, actually. Can
> > the state be changed so it has more visibility to get reviews?
>
> I think resend would be best.
>


-- 
Íñigo Huguet


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities
  2022-01-28 15:19 [PATCH net-next 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2022-01-28 15:19 ` [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
  2022-01-28 15:19 ` [PATCH net-next 2/2] sfc: set affinity hints " Íñigo Huguet
@ 2022-02-16  9:41 ` Íñigo Huguet
  2022-02-16  9:41   ` [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
                     ` (2 more replies)
  2 siblings, 3 replies; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-16  9:41 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

In sfc driver one RX queue per physical core was allocated by default.
Later on, IRQ affinities were set spreading the IRQs in all NUMA local
CPUs.

However, with that default configuration it result in a non very optimal
configuration in many modern systems. Specifically, in systems with hyper
threading and 2 NUMA nodes, affinities are set in a way that IRQs are
handled by all logical cores of one same NUMA node. Handling IRQs from
both hyper threading siblings has no benefit, and setting affinities to one
queue per physical core is neither a very good idea because there is a
performance penalty for moving data across nodes (I was able to check it
with some XDP tests using pktgen).

This patches reduce the default number of channels to one per physical
core in the local NUMA node. Then, they set IRQ affinities to CPUs in
the local NUMA node only. This way we save hardware resources since
channels are limited resources. We also leave more room for XDP_TX
channels without hitting driver's limit of 32 channels per interface.

Running performance tests using iperf with a SFC9140 device showed no
performance penalty for reducing the number of channels.

RX XDP tests showed that performance can go down to less than half if
the IRQ is handled by a CPU in a different NUMA node, which doesn't
happen with the new defaults from this patches.

Íñigo Huguet (2):
  sfc: default config to 1 channel/core in local NUMA node only
  sfc: set affinity hints in local NUMA node only

 drivers/net/ethernet/sfc/efx_channels.c | 62 +++++++++++++++++--------
 1 file changed, 43 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
@ 2022-02-16  9:41   ` Íñigo Huguet
  2022-02-19 13:50     ` Martin Habets
  2022-02-16  9:41   ` [PATCH net-next resend 2/2] sfc: set affinity hints " Íñigo Huguet
  2022-02-19  5:19   ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Jakub Kicinski
  2 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-16  9:41 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.

Fallback to all other online cores if there are not online CPUs in local
NUMA node.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/efx_channels.c | 50 ++++++++++++++++---------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
index ead550ae2709..ec6c2f231e73 100644
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -78,31 +78,48 @@ static const struct efx_channel_type efx_default_channel_type = {
  * INTERRUPTS
  *************/
 
-static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+static unsigned int count_online_cores(struct efx_nic *efx, bool local_node)
 {
-	cpumask_var_t thread_mask;
+	cpumask_var_t filter_mask;
 	unsigned int count;
 	int cpu;
+
+	if (unlikely(!zalloc_cpumask_var(&filter_mask, GFP_KERNEL))) {
+		netif_warn(efx, probe, efx->net_dev,
+			   "RSS disabled due to allocation failure\n");
+		return 1;
+	}
+
+	cpumask_copy(filter_mask, cpu_online_mask);
+	if (local_node) {
+		int numa_node = pcibus_to_node(efx->pci_dev->bus);
+
+		cpumask_and(filter_mask, filter_mask, cpumask_of_node(numa_node));
+	}
+
+	count = 0;
+	for_each_cpu(cpu, filter_mask) {
+		++count;
+		cpumask_andnot(filter_mask, filter_mask, topology_sibling_cpumask(cpu));
+	}
+
+	free_cpumask_var(filter_mask);
+
+	return count;
+}
+
+static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+{
+	unsigned int count;
 
 	if (rss_cpus) {
 		count = rss_cpus;
 	} else {
-		if (unlikely(!zalloc_cpumask_var(&thread_mask, GFP_KERNEL))) {
-			netif_warn(efx, probe, efx->net_dev,
-				   "RSS disabled due to allocation failure\n");
-			return 1;
-		}
-
-		count = 0;
-		for_each_online_cpu(cpu) {
-			if (!cpumask_test_cpu(cpu, thread_mask)) {
-				++count;
-				cpumask_or(thread_mask, thread_mask,
-					   topology_sibling_cpumask(cpu));
-			}
-		}
+		count = count_online_cores(efx, true);
 
-		free_cpumask_var(thread_mask);
+		/* If no online CPUs in local node, fallback to any online CPUs */
+		if (count == 0)
+			count = count_online_cores(efx, false);
 	}
 
 	if (count > EFX_MAX_RX_QUEUES) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next resend 2/2] sfc: set affinity hints in local NUMA node only
  2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2022-02-16  9:41   ` [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
@ 2022-02-16  9:41   ` Íñigo Huguet
  2022-02-19 13:51     ` Martin Habets
  2022-02-19  5:19   ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Jakub Kicinski
  2 siblings, 1 reply; 18+ messages in thread
From: Íñigo Huguet @ 2022-02-16  9:41 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: davem, kuba, netdev, Íñigo Huguet

Affinity hints were being set to CPUs in local NUMA node first, and then
in other CPUs. This was creating 2 unintended issues:
1. Channels created to be assigned each to a different physical core
   were assigned to hyperthreading siblings because of being in same
   NUMA node.
   Since the patch previous to this one, this did not longer happen
   with default rss_cpus modparam because less channels are created.
2. XDP channels could be assigned to CPUs in different NUMA nodes,
   decreasing performance too much (to less than half in some of my
   tests).

This patch sets the affinity hints spreading the channels only in local
NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
is online has been added too.

Example of CPUs being assigned in a non optimal way before this and the
previous patch (note: in this system, xdp-8 to xdp-15 are created
because num_possible_cpus == 64, but num_present_cpus == 32 so they're
never used):

$ lscpu | grep -i numa
NUMA node(s):                    2
NUMA node0 CPU(s):               0-7,16-23
NUMA node1 CPU(s):               8-15,24-31

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
/proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
/proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
/proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
/proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
/proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
/proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
/proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
/proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
/proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
/proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
/proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
/proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
/proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
/proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
/proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
/proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
/proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
/proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
/proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
/proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
/proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
/proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
/proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31

CPUs assignments after this and previous patch, so normal channels
created only one per core in NUMA node and affinities set only to local
NUMA node:

$ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
/proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
/proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
/proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
/proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
/proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
/proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
/proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
/proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
/proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
/proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
/proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
/proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
/proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
/proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
/proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
/proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
/proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
/proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
/proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
/proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
/proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
/proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
/proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
/proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/efx_channels.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
index ec6c2f231e73..ef3168fbb5a6 100644
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -387,10 +387,18 @@ void efx_set_interrupt_affinity(struct efx_nic *efx)
 {
 	struct efx_channel *channel;
 	unsigned int cpu;
+	int numa_node = pcibus_to_node(efx->pci_dev->bus);
+	const struct cpumask *numa_mask = cpumask_of_node(numa_node);
 
+	/* If no online CPUs in local node, fallback to any online CPU */
+	if (cpumask_first_and(cpu_online_mask, numa_mask) >= nr_cpu_ids)
+		numa_mask = cpu_online_mask;
+
+	cpu = -1;
 	efx_for_each_channel(channel, efx) {
-		cpu = cpumask_local_spread(channel->channel,
-					   pcibus_to_node(efx->pci_dev->bus));
+		cpu = cpumask_next_and(cpu, cpu_online_mask, numa_mask);
+		if (cpu >= nr_cpu_ids)
+			cpu = cpumask_first_and(cpu_online_mask, numa_mask);
 		irq_set_affinity_hint(channel->irq, cpumask_of(cpu));
 	}
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities
  2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
  2022-02-16  9:41   ` [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
  2022-02-16  9:41   ` [PATCH net-next resend 2/2] sfc: set affinity hints " Íñigo Huguet
@ 2022-02-19  5:19   ` Jakub Kicinski
  2 siblings, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2022-02-19  5:19 UTC (permalink / raw)
  To: ecree.xilinx, habetsm.xilinx; +Cc: Íñigo Huguet, davem, netdev

On Wed, 16 Feb 2022 10:41:37 +0100 Íñigo Huguet wrote:
> In sfc driver one RX queue per physical core was allocated by default.
> Later on, IRQ affinities were set spreading the IRQs in all NUMA local
> CPUs.
> 
> However, with that default configuration it result in a non very optimal
> configuration in many modern systems. Specifically, in systems with hyper
> threading and 2 NUMA nodes, affinities are set in a way that IRQs are
> handled by all logical cores of one same NUMA node. Handling IRQs from
> both hyper threading siblings has no benefit, and setting affinities to one
> queue per physical core is neither a very good idea because there is a
> performance penalty for moving data across nodes (I was able to check it
> with some XDP tests using pktgen).
> 
> This patches reduce the default number of channels to one per physical
> core in the local NUMA node. Then, they set IRQ affinities to CPUs in
> the local NUMA node only. This way we save hardware resources since
> channels are limited resources. We also leave more room for XDP_TX
> channels without hitting driver's limit of 32 channels per interface.
> 
> Running performance tests using iperf with a SFC9140 device showed no
> performance penalty for reducing the number of channels.
> 
> RX XDP tests showed that performance can go down to less than half if
> the IRQ is handled by a CPU in a different NUMA node, which doesn't
> happen with the new defaults from this patches.

Martin, Ed, any thoughts?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-16  9:41   ` [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
@ 2022-02-19 13:50     ` Martin Habets
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Habets @ 2022-02-19 13:50 UTC (permalink / raw)
  To: Íñigo Huguet; +Cc: ecree.xilinx, davem, kuba, netdev

On Wed, Feb 16, 2022 at 10:41:38AM +0100, Íñigo Huguet wrote:
> Handling channels from CPUs in different NUMA node can penalize
> performance, so better configure only one channel per core in the same
> NUMA node than the NIC, and not per each core in the system.
> 
> Fallback to all other online cores if there are not online CPUs in local
> NUMA node.
> 
> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Acked-by: Martin Habets <habetsm.xilinx@gmail.com>

> ---
>  drivers/net/ethernet/sfc/efx_channels.c | 50 ++++++++++++++++---------
>  1 file changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
> index ead550ae2709..ec6c2f231e73 100644
> --- a/drivers/net/ethernet/sfc/efx_channels.c
> +++ b/drivers/net/ethernet/sfc/efx_channels.c
> @@ -78,31 +78,48 @@ static const struct efx_channel_type efx_default_channel_type = {
>   * INTERRUPTS
>   *************/
>  
> -static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
> +static unsigned int count_online_cores(struct efx_nic *efx, bool local_node)
>  {
> -	cpumask_var_t thread_mask;
> +	cpumask_var_t filter_mask;
>  	unsigned int count;
>  	int cpu;
> +
> +	if (unlikely(!zalloc_cpumask_var(&filter_mask, GFP_KERNEL))) {
> +		netif_warn(efx, probe, efx->net_dev,
> +			   "RSS disabled due to allocation failure\n");
> +		return 1;
> +	}
> +
> +	cpumask_copy(filter_mask, cpu_online_mask);
> +	if (local_node) {
> +		int numa_node = pcibus_to_node(efx->pci_dev->bus);
> +
> +		cpumask_and(filter_mask, filter_mask, cpumask_of_node(numa_node));
> +	}
> +
> +	count = 0;
> +	for_each_cpu(cpu, filter_mask) {
> +		++count;
> +		cpumask_andnot(filter_mask, filter_mask, topology_sibling_cpumask(cpu));
> +	}
> +
> +	free_cpumask_var(filter_mask);
> +
> +	return count;
> +}
> +
> +static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
> +{
> +	unsigned int count;
>  
>  	if (rss_cpus) {
>  		count = rss_cpus;
>  	} else {
> -		if (unlikely(!zalloc_cpumask_var(&thread_mask, GFP_KERNEL))) {
> -			netif_warn(efx, probe, efx->net_dev,
> -				   "RSS disabled due to allocation failure\n");
> -			return 1;
> -		}
> -
> -		count = 0;
> -		for_each_online_cpu(cpu) {
> -			if (!cpumask_test_cpu(cpu, thread_mask)) {
> -				++count;
> -				cpumask_or(thread_mask, thread_mask,
> -					   topology_sibling_cpumask(cpu));
> -			}
> -		}
> +		count = count_online_cores(efx, true);
>  
> -		free_cpumask_var(thread_mask);
> +		/* If no online CPUs in local node, fallback to any online CPUs */
> +		if (count == 0)
> +			count = count_online_cores(efx, false);
>  	}
>  
>  	if (count > EFX_MAX_RX_QUEUES) {
> -- 
> 2.31.1

-- 
Martin Habets <habetsm.xilinx@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next resend 2/2] sfc: set affinity hints in local NUMA node only
  2022-02-16  9:41   ` [PATCH net-next resend 2/2] sfc: set affinity hints " Íñigo Huguet
@ 2022-02-19 13:51     ` Martin Habets
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Habets @ 2022-02-19 13:51 UTC (permalink / raw)
  To: Íñigo Huguet; +Cc: ecree.xilinx, davem, kuba, netdev

On Wed, Feb 16, 2022 at 10:41:39AM +0100, Íñigo Huguet wrote:
> Affinity hints were being set to CPUs in local NUMA node first, and then
> in other CPUs. This was creating 2 unintended issues:
> 1. Channels created to be assigned each to a different physical core
>    were assigned to hyperthreading siblings because of being in same
>    NUMA node.
>    Since the patch previous to this one, this did not longer happen
>    with default rss_cpus modparam because less channels are created.
> 2. XDP channels could be assigned to CPUs in different NUMA nodes,
>    decreasing performance too much (to less than half in some of my
>    tests).
> 
> This patch sets the affinity hints spreading the channels only in local
> NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
> is online has been added too.
> 
> Example of CPUs being assigned in a non optimal way before this and the
> previous patch (note: in this system, xdp-8 to xdp-15 are created
> because num_possible_cpus == 64, but num_present_cpus == 32 so they're
> never used):
> 
> $ lscpu | grep -i numa
> NUMA node(s):                    2
> NUMA node0 CPU(s):               0-7,16-23
> NUMA node1 CPU(s):               8-15,24-31
> 
> $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
> /proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
> /proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
> /proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
> /proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
> /proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
> /proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
> /proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
> /proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
> /proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
> /proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
> /proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
> /proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
> /proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
> /proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
> /proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
> /proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
> /proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
> /proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
> /proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
> /proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
> /proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
> /proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
> /proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
> /proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
> /proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
> /proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
> /proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
> /proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
> /proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
> /proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
> /proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
> /proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
> 
> CPUs assignments after this and previous patch, so normal channels
> created only one per core in NUMA node and affinities set only to local
> NUMA node:
> 
> $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
> /proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
> /proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
> /proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
> /proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
> /proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
> /proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
> /proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
> /proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
> /proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
> /proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
> /proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
> /proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
> /proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
> /proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
> /proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
> /proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
> /proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
> /proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
> /proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
> /proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
> /proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
> /proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
> /proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
> /proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
> 
> Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
> ---
>  drivers/net/ethernet/sfc/efx_channels.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
> index ec6c2f231e73..ef3168fbb5a6 100644
> --- a/drivers/net/ethernet/sfc/efx_channels.c
> +++ b/drivers/net/ethernet/sfc/efx_channels.c
> @@ -387,10 +387,18 @@ void efx_set_interrupt_affinity(struct efx_nic *efx)
>  {
>  	struct efx_channel *channel;
>  	unsigned int cpu;
> +	int numa_node = pcibus_to_node(efx->pci_dev->bus);
> +	const struct cpumask *numa_mask = cpumask_of_node(numa_node);

This violates the reverse Xmas convention.
Other than that it looks good to me.

Martin

> +	/* If no online CPUs in local node, fallback to any online CPU */
> +	if (cpumask_first_and(cpu_online_mask, numa_mask) >= nr_cpu_ids)
> +		numa_mask = cpu_online_mask;
> +
> +	cpu = -1;
>  	efx_for_each_channel(channel, efx) {
> -		cpu = cpumask_local_spread(channel->channel,
> -					   pcibus_to_node(efx->pci_dev->bus));
> +		cpu = cpumask_next_and(cpu, cpu_online_mask, numa_mask);
> +		if (cpu >= nr_cpu_ids)
> +			cpu = cpumask_first_and(cpu_online_mask, numa_mask);
>  		irq_set_affinity_hint(channel->irq, cpumask_of(cpu));
>  	}
>  }
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only
  2022-02-11 19:01               ` Jakub Kicinski
  2022-02-14  7:22                 ` Íñigo Huguet
@ 2022-02-19 13:53                 ` Martin Habets
  1 sibling, 0 replies; 18+ messages in thread
From: Martin Habets @ 2022-02-19 13:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Íñigo Huguet, Edward Cree, David S. Miller, netdev

On Fri, Feb 11, 2022 at 11:01:00AM -0800, Jakub Kicinski wrote:
> On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> > Totally. My comment was intended to be more like a question to see why
> > we should or shouldn't consider NUMA nodes in
> > netif_get_num_default_rss_queues. But now I understand your point
> > better.
> > 
> > However, would it make sense something like this for
> > netif_get_num_default_rss_queues, or it would be a bit overkill?
> > if the system has more than one NUMA node, allocate one queue per
> > physical core in local NUMA node.
> > else, allocate physical cores / 2
> 
> I don't have a strong opinion on the NUMA question, to be honest.
> It gets complicated pretty quickly. If there is one NIC we may or 
> may not want to divide - for pure packet forwarding sure, best if
> its done on the node with the NIC, but that assumes the other node 
> is idle or doing something else? How does it not need networking?
> 
> If each node has a separate NIC we should definitely divide. But
> it's impossible to know the NIC count at the netdev level..
> 
> So my thinking was let's leave NUMA configurations to manual tuning.
> If we don't do anything special for NUMA it's less likely someone will
> tell us we did the wrong thing there :) But feel free to implement what
> you suggested above.
> 
> One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs 
> in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
> potentially be problematic if we wanted to divide by number of nodes.
> Maybe not as much if just dividing by 2.

Since one week Xilinx is part of AMD. In time I'm sure we'll be able
to investigate AMD specifics.

Martin

> > Another thing: this patch series appears in patchwork with state
> > "Changes Requested", but no changes have been requested, actually. Can
> > the state be changed so it has more visibility to get reviews?
> 
> I think resend would be best.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-02-19 13:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-28 15:19 [PATCH net-next 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
2022-01-28 15:19 ` [PATCH net-next 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
2022-01-28 22:27   ` Jakub Kicinski
2022-02-07 15:03     ` Íñigo Huguet
2022-02-07 16:53       ` Jakub Kicinski
2022-02-10  9:35         ` Íñigo Huguet
2022-02-10 16:22           ` Jakub Kicinski
2022-02-11 11:05             ` Íñigo Huguet
2022-02-11 19:01               ` Jakub Kicinski
2022-02-14  7:22                 ` Íñigo Huguet
2022-02-19 13:53                 ` Martin Habets
2022-01-28 15:19 ` [PATCH net-next 2/2] sfc: set affinity hints " Íñigo Huguet
2022-02-16  9:41 ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Íñigo Huguet
2022-02-16  9:41   ` [PATCH net-next resend 1/2] sfc: default config to 1 channel/core in local NUMA node only Íñigo Huguet
2022-02-19 13:50     ` Martin Habets
2022-02-16  9:41   ` [PATCH net-next resend 2/2] sfc: set affinity hints " Íñigo Huguet
2022-02-19 13:51     ` Martin Habets
2022-02-19  5:19   ` [PATCH net-next resend 0/2] sfc: optimize RXQs count and affinities Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.