linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
@ 2022-09-23 13:25 Valentin Schneider
  2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 13:25 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

Hi folks,

Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
it).

The proposed interface involved an array of CPUs and a temporary cpumask, and
being my difficult self what I'm proposing here is an interface that doesn't
require any temporary storage other than some stack variables (at the cost of
one wild macro).

Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
his fancy new FIND_NEXT_BIT() macro.

[1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
[2]: https://github.com/norov/linux.git/ -b bitmap-for-next

A note on treewide use of for_each_cpu_andnot()
===============================================

I've used the below coccinelle script to find places that could be patched (I
couldn't figure out the valid syntax to patch from coccinelle itself):

,-----
@tmpandnot@
expression tmpmask;
iterator for_each_cpu;
position p;
statement S;
@@
cpumask_andnot(tmpmask, ...);

...

(
for_each_cpu@p(..., tmpmask, ...)
	S
|
for_each_cpu@p(..., tmpmask, ...)
{
	...
}
)

@script:python depends on tmpandnot@
p << tmpandnot.p;
@@
coccilib.report.print_report(p[0], "andnot loop here")
'-----

Which yields (against c40e8341e3b3):

.//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
.//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
.//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
.//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
.//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
.//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
.//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
.//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
.//kernel/sched/core.c:345:1-13: andnot loop here
.//kernel/sched/core.c:366:1-13: andnot loop here
.//net/core/dev.c:3058:1-13: andnot loop here

A lot of those are actually of the shape

  for_each_cpu(cpu, mask) {
      ...
      cpumask_andnot(mask, ...);
  }

I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
but I decided to just stick to the one obvious one in __sched_core_flip().
  
Revisions
=========

v3 -> v4
++++++++

o Rebased on top of Yury's bitmap-for-next
o Added Tariq's mlx5e patch
o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
  hops=0 case

v2 -> v3
++++++++

o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
o New patches to fix issues raised by running the above

o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)

v1 -> v2
++++++++

o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
o Rebase onto v6.0-rc1

Cheers,
Valentin

Tariq Toukan (1):
  net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
    hints

Valentin Schneider (6):
  lib/find_bit: Introduce find_next_andnot_bit()
  cpumask: Introduce for_each_cpu_andnot()
  lib/test_cpumask: Add for_each_cpu_and(not) tests
  sched/core: Merge cpumask_andnot()+for_each_cpu() into
    for_each_cpu_andnot()
  sched/topology: Introduce sched_numa_hop_mask()
  sched/topology: Introduce for_each_numa_hop_cpu()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
 include/linux/cpumask.h                      | 39 ++++++++++++++++
 include/linux/find.h                         | 33 +++++++++++++
 include/linux/topology.h                     | 49 ++++++++++++++++++++
 kernel/sched/core.c                          |  5 +-
 kernel/sched/topology.c                      | 31 +++++++++++++
 lib/cpumask_kunit.c                          | 19 ++++++++
 lib/find_bit.c                               |  9 ++++
 8 files changed, 192 insertions(+), 6 deletions(-)

--
2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit()
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
@ 2022-09-23 13:25 ` Valentin Schneider
  2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 13:25 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

In preparation of introducing for_each_cpu_andnot(), add a variant of
find_next_bit() that negate the bits in @addr2 when ANDing them with the
bits in @addr1.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
 lib/find_bit.c       |  9 +++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index dead6f53a97b..e60b1ce89b29 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -12,6 +12,8 @@ unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits,
 				unsigned long start);
 unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start);
+unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start);
 unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
 					 unsigned long start);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
@@ -86,6 +88,37 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 }
 #endif
 
+#ifndef find_next_andnot_bit
+/**
+ * find_next_andnot_bit - find the next set bit in *addr1 excluding all the bits
+ *                        in *addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_andnot_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_andnot_bit(addr1, addr2, size, offset);
+}
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/lib/find_bit.c b/lib/find_bit.c
index d00ee23ab657..53b02405421b 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -120,6 +120,15 @@ unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long
 EXPORT_SYMBOL(_find_next_and_bit);
 #endif
 
+#ifndef find_next_andnot_bit
+unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					unsigned long nbits, unsigned long start)
+{
+	return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start);
+}
+EXPORT_SYMBOL(_find_next_andnot_bit);
+#endif
+
 #ifndef find_next_zero_bit
 unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
 					 unsigned long start)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
  2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider
@ 2022-09-23 15:44 ` Yury Norov
  2022-09-23 15:49   ` Valentin Schneider
  2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2022-09-23 15:44 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote:
> Hi folks,

Hi,

I received only 1st patch of the series. Can you give me a link for
the full series so that I'll see how the new API is used?

Thanks,
Yury
 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>   for_each_cpu(cpu, mask) {
>       ...
>       cpumask_andnot(mask, ...);
>   }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>   
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>   hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 
> Tariq Toukan (1):
>   net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
>     hints
> 
> Valentin Schneider (6):
>   lib/find_bit: Introduce find_next_andnot_bit()
>   cpumask: Introduce for_each_cpu_andnot()
>   lib/test_cpumask: Add for_each_cpu_and(not) tests
>   sched/core: Merge cpumask_andnot()+for_each_cpu() into
>     for_each_cpu_andnot()
>   sched/topology: Introduce sched_numa_hop_mask()
>   sched/topology: Introduce for_each_numa_hop_cpu()
> 
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
>  include/linux/cpumask.h                      | 39 ++++++++++++++++
>  include/linux/find.h                         | 33 +++++++++++++
>  include/linux/topology.h                     | 49 ++++++++++++++++++++
>  kernel/sched/core.c                          |  5 +-
>  kernel/sched/topology.c                      | 31 +++++++++++++
>  lib/cpumask_kunit.c                          | 19 ++++++++
>  lib/find_bit.c                               |  9 ++++
>  8 files changed, 192 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
  2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov
@ 2022-09-23 15:49   ` Valentin Schneider
  0 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On 23/09/22 08:44, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote:
>> Hi folks,
>
> Hi,
>
> I received only 1st patch of the series. Can you give me a link for
> the full series so that I'll see how the new API is used?
>

Sigh, I got this when sending these out:

  4.3.0 Temporary System Problem.  Try again later (10)

I'm going to re-send the missing ones and *hopefully* have them thread
properly. Sorry about that.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot()
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
  2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider
  2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-25 15:23   ` Yury Norov
  2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

for_each_cpu_and() is very convenient as it saves having to allocate a
temporary cpumask to store the result of cpumask_and(). The same issue
applies to cpumask_andnot() which doesn't actually need temporary storage
for iteration purposes.

Following what has been done for for_each_cpu_and(), introduce
for_each_cpu_andnot().

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 1b442fb2001f..4c69e338bb8c 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
 		nr_cpumask_bits, n + 1);
 }
 
+/**
+ * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p
+ */
+static inline
+unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p,
+				 const struct cpumask *src2p)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+		nr_cpumask_bits, n + 1);
+}
+
 /**
  * for_each_cpu - iterate over every cpu in a mask
  * @cpu: the (optionally unsigned) integer iterator
@@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
 		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
 		(cpu) < nr_cpu_ids;)
 
+/**
+ * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
+ *			 those present in another.
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places.  It is equivalent to:
+ *	struct cpumask tmp;
+ *	cpumask_andnot(&tmp, &mask1, &mask2);
+ *	for_each_cpu(cpu, &tmp)
+ *		...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_andnot(cpu, mask1, mask2)				\
+	for ((cpu) = -1;						\
+		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
+		(cpu) < nr_cpu_ids;)
+
 /**
  * cpumask_any_but - return a "random" in a cpumask, but not this one.
  * @mask: the cpumask to search
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (2 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Yury Norov, Saeed Mahameed, Leon Romanovsky, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

Following the recent introduction of for_each_andnot(), add some tests to
ensure for_each_cpu_and(not) results in the same as iterating over the
result of cpumask_and(not)().

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 lib/cpumask_kunit.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/lib/cpumask_kunit.c b/lib/cpumask_kunit.c
index ecbeec72221e..d1fc6ece21f3 100644
--- a/lib/cpumask_kunit.c
+++ b/lib/cpumask_kunit.c
@@ -33,6 +33,19 @@
 		KUNIT_EXPECT_EQ_MSG((test), nr_cpu_ids - mask_weight, iter, MASK_MSG(mask));	\
 	} while (0)
 
+#define EXPECT_FOR_EACH_CPU_OP_EQ(test, op, mask1, mask2)			\
+	do {									\
+		const cpumask_t *m1 = (mask1);					\
+		const cpumask_t *m2 = (mask2);					\
+		int weight;                                                     \
+		int cpu, iter = 0;						\
+		cpumask_##op(&mask_tmp, m1, m2);                                \
+		weight = cpumask_weight(&mask_tmp);				\
+		for_each_cpu_##op(cpu, mask1, mask2)				\
+			iter++;							\
+		KUNIT_EXPECT_EQ((test), weight, iter);				\
+	} while (0)
+
 #define EXPECT_FOR_EACH_CPU_WRAP_EQ(test, mask)			\
 	do {							\
 		const cpumask_t *m = (mask);			\
@@ -54,6 +67,7 @@
 
 static cpumask_t mask_empty;
 static cpumask_t mask_all;
+static cpumask_t mask_tmp;
 
 static void test_cpumask_weight(struct kunit *test)
 {
@@ -101,10 +115,15 @@ static void test_cpumask_iterators(struct kunit *test)
 	EXPECT_FOR_EACH_CPU_EQ(test, &mask_empty);
 	EXPECT_FOR_EACH_CPU_NOT_EQ(test, &mask_empty);
 	EXPECT_FOR_EACH_CPU_WRAP_EQ(test, &mask_empty);
+	EXPECT_FOR_EACH_CPU_OP_EQ(test, and, &mask_empty, &mask_empty);
+	EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, &mask_empty);
+	EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, &mask_empty, &mask_empty);
 
 	EXPECT_FOR_EACH_CPU_EQ(test, cpu_possible_mask);
 	EXPECT_FOR_EACH_CPU_NOT_EQ(test, cpu_possible_mask);
 	EXPECT_FOR_EACH_CPU_WRAP_EQ(test, cpu_possible_mask);
+	EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, cpu_possible_mask);
+	EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, cpu_possible_mask, &mask_empty);
 }
 
 static void test_cpumask_iterators_builtin(struct kunit *test)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (3 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Yury Norov, Saeed Mahameed, Leon Romanovsky, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

This removes the second use of the sched_core_mask temporary mask.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..b4c3112b0095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -360,10 +360,7 @@ static void __sched_core_flip(bool enabled)
 	/*
 	 * Toggle the offline CPUs.
 	 */
-	cpumask_copy(&sched_core_mask, cpu_possible_mask);
-	cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask);
-
-	for_each_cpu(cpu, &sched_core_mask)
+	for_each_cpu_andnot(cpu, cpu_possible_mask, cpu_online_mask)
 		cpu_rq(cpu)->core_enabled = enabled;
 
 	cpus_read_unlock();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (4 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-25 15:00   ` Yury Norov
  2022-09-25 18:05   ` Yury Norov
  2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.

Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/topology.h | 12 ++++++++++++
 kernel/sched/topology.c  | 31 +++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..3e91ae6d0ad5 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	if (node == NUMA_NO_NODE && !hops)
+		return cpu_online_mask;
+
+	return ERR_PTR(-EOPNOTSUPP);
+}
+#endif	/* CONFIG_NUMA */
+
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..ee77706603c0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in distance; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+	if (node == NUMA_NO_NODE && !hops)
+		return cpu_online_mask;
+
+	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+		return ERR_PTR(-EINVAL);
+
+	if (!masks)
+		return NULL;
+
+	return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu()
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (5 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-25 14:58   ` Yury Norov
  2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
reachable within a given distance budget, but this means each successive
cpumask is a superset of the previous one.

Code wanting to allocate one item per CPU (e.g. IRQs) at increasing
distances would thus need to allocate a temporary cpumask to note which
CPUs have already been visited. This can be prevented by leveraging
for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 3e91ae6d0ad5..7aa7e6a4c739 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -257,5 +257,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
 }
 #endif	/* CONFIG_NUMA */
 
+/**
+ * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
+ *                         starting from a given node.
+ * @cpu: the iteration variable.
+ * @node: the NUMA node to start the search from.
+ *
+ * Requires rcu_lock to be held.
+ * Careful: this is a double loop, 'break' won't work as expected.
+ *
+ *
+ * Implementation notes:
+ *
+ * Providing it is valid, the mask returned by
+ *  sched_numa_hop_mask(node, hops+1)
+ * is a superset of the one returned by
+ *   sched_numa_hop_mask(node, hops)
+ * which may not be that useful for drivers that try to spread things out and
+ * want to visit a CPU not more than once.
+ *
+ * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
+ * of sched_numa_hop_mask(node, hops+1) with the CPUs of
+ * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
+ * a given distance away (rather than *up to* a given distance).
+ *
+ * hops=0 forces us to play silly games: we pass cpu_none_mask to
+ * for_each_cpu_andnot(), which turns it into for_each_cpu().
+ */
+#define for_each_numa_hop_cpu(cpu, node)				       \
+	for (struct { const struct cpumask *curr, *prev; int hops; } __v =     \
+		     { sched_numa_hop_mask(node, 0), NULL, 0 };		       \
+	     !IS_ERR_OR_NULL(__v.curr);					       \
+	     __v.hops++,                                                       \
+	     __v.prev = __v.curr,					       \
+	     __v.curr = sched_numa_hop_mask(node, __v.hops))                   \
+		for_each_cpu_andnot(cpu,				       \
+				    __v.curr,				       \
+				    __v.hops ? __v.prev : cpu_none_mask)
 
 #endif /* _LINUX_TOPOLOGY_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (6 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider
@ 2022-09-23 15:55 ` Valentin Schneider
  2022-09-25  7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan
  2022-10-18  6:36 ` Tariq Toukan
  9 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw)
  To: netdev, linux-rdma, linux-kernel
  Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov,
	Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron,
	Gal Pressman, Jesse Brandeburg

From: Tariq Toukan <tariqt@nvidia.com>

In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with the improved for_each_numa_hop_cpu() API that minds the
actual distances, so that remote NUMAs with short distance are preferred
over farther ones.

This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).

[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
   int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));

Performance tests:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
|                         | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline                | 52.3      | 6.4 %            | 17.9 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

* CPU util on active cores only.

Setups details (similar for both sides):

NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        16
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2594.804
BogoMIPS:            4890.73
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7,128-135
NUMA node1 CPU(s):   8-15,136-143
NUMA node2 CPU(s):   16-23,144-151
NUMA node3 CPU(s):   24-31,152-159
NUMA node4 CPU(s):   32-39,160-167
NUMA node5 CPU(s):   40-47,168-175
NUMA node6 CPU(s):   48-55,176-183
NUMA node7 CPU(s):   56-63,184-191
NUMA node8 CPU(s):   64-71,192-199
NUMA node9 CPU(s):   72-79,200-207
NUMA node10 CPU(s):  80-87,208-215
NUMA node11 CPU(s):  88-95,216-223
NUMA node12 CPU(s):  96-103,224-231
NUMA node13 CPU(s):  104-111,232-239
NUMA node14 CPU(s):  112-119,240-247
NUMA node15 CPU(s):  120-127,248-255
..

$ numactl -H
..
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  11  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  1:  11  10  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  2:  11  11  10  11  12  12  12  12  32  32  32  32  32  32  32  32
  3:  11  11  11  10  12  12  12  12  32  32  32  32  32  32  32  32
  4:  12  12  12  12  10  11  11  11  32  32  32  32  32  32  32  32
  5:  12  12  12  12  11  10  11  11  32  32  32  32  32  32  32  32
  6:  12  12  12  12  11  11  10  11  32  32  32  32  32  32  32  32
  7:  12  12  12  12  11  11  11  10  32  32  32  32  32  32  32  32
  8:  32  32  32  32  32  32  32  32  10  11  11  11  12  12  12  12
  9:  32  32  32  32  32  32  32  32  11  10  11  11  12  12  12  12
 10:  32  32  32  32  32  32  32  32  11  11  10  11  12  12  12  12
 11:  32  32  32  32  32  32  32  32  11  11  11  10  12  12  12  12
 12:  32  32  32  32  32  32  32  32  12  12  12  12  10  11  11  11
 13:  32  32  32  32  32  32  32  32  12  12  12  12  11  10  11  11
 14:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  10  11
 15:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  11  10

$ cat /sys/class/net/ens5f0/device/numa_node
14

Affinity hints (127 IRQs):
Before:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004
350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008
351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010
352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020
353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040
354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200
357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400
358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800
359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000
360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000
362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000
365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000
366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000
367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000
368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000
369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000
370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000
371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000
372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000
373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000
374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000
375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000
376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000
377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000
378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000
379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000
380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000
381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000
382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000
383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000
384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000
385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000
386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000
387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000
388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000
389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000
390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000
391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000
392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000
393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000
394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000

After:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000
363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000
364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000
365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000
366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000
367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000
368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000
369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000
370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000
371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000
372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000
373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000
374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000
375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000
376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000
377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000
378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000
379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000
380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000
381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000
382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000
383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000
428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000
429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000
430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000
431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000
432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000
433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000
434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000
435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000
436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000
437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000
438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000
439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000
440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000
441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000
442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000
443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000
444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000
445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000
446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000
447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000
448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000
449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000
450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000
451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000
452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000
453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000
454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000
455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000
456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000
457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..4818cc0c9bc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 	int ncomp_eqs = table->num_comp_eqs;
 	u16 *cpus;
 	int ret;
+	int cpu;
 	int i;
 
 	ncomp_eqs = table->num_comp_eqs;
@@ -830,8 +831,16 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	i = 0;
+	rcu_read_lock();
+	for_each_numa_hop_cpu(cpu, dev->priv.numa_node) {
+		cpus[i] = cpu;
+		if (++i == ncomp_eqs)
+			goto spread_done;
+	}
+spread_done:
+	rcu_read_unlock();
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
 	kfree(cpus);
 	if (ret < 0)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (7 preceding siblings ...)
  2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider
@ 2022-09-25  7:48 ` Tariq Toukan
  2022-10-18  6:36 ` Tariq Toukan
  9 siblings, 0 replies; 25+ messages in thread
From: Tariq Toukan @ 2022-09-25  7:48 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg



On 9/23/2022 4:25 PM, Valentin Schneider wrote:
> Hi folks,
> 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>    for_each_cpu(cpu, mask) {
>        ...
>        cpumask_andnot(mask, ...);
>    }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>    
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>    hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 
> Tariq Toukan (1):
>    net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
>      hints
> 
> Valentin Schneider (6):
>    lib/find_bit: Introduce find_next_andnot_bit()
>    cpumask: Introduce for_each_cpu_andnot()
>    lib/test_cpumask: Add for_each_cpu_and(not) tests
>    sched/core: Merge cpumask_andnot()+for_each_cpu() into
>      for_each_cpu_andnot()
>    sched/topology: Introduce sched_numa_hop_mask()
>    sched/topology: Introduce for_each_numa_hop_cpu()
> 
>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
>   include/linux/cpumask.h                      | 39 ++++++++++++++++
>   include/linux/find.h                         | 33 +++++++++++++
>   include/linux/topology.h                     | 49 ++++++++++++++++++++
>   kernel/sched/core.c                          |  5 +-
>   kernel/sched/topology.c                      | 31 +++++++++++++
>   lib/cpumask_kunit.c                          | 19 ++++++++
>   lib/find_bit.c                               |  9 ++++
>   8 files changed, 192 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1
> 

Valentin, thank you for investing your time here.

Acked-by: Tariq Toukan <tariqt@nvidia.com>

Tested on my mlx5 environment.
It works as expected, including the case of node == NUMA_NO_NODE.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu()
  2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider
@ 2022-09-25 14:58   ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  0 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2022-09-25 14:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Fri, Sep 23, 2022 at 04:55:41PM +0100, Valentin Schneider wrote:
> The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs
> reachable within a given distance budget, but this means each successive
> cpumask is a superset of the previous one.
> 
> Code wanting to allocate one item per CPU (e.g. IRQs) at increasing
> distances would thus need to allocate a temporary cpumask to note which
> CPUs have already been visited. This can be prevented by leveraging
> for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 3e91ae6d0ad5..7aa7e6a4c739 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -257,5 +257,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
>  }
>  #endif	/* CONFIG_NUMA */
>  
> +/**
> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
> + *                         starting from a given node.
> + * @cpu: the iteration variable.
> + * @node: the NUMA node to start the search from.
> + *
> + * Requires rcu_lock to be held.
> + * Careful: this is a double loop, 'break' won't work as expected.

This warning concerns me not only because new iteration loop hides
complexity and breaks 'break' (sic!), but also because it looks too
specific. Why don't you split it, so instead:

       for_each_numa_hop_cpu(cpu, dev->priv.numa_node) {
               cpus[i] = cpu;
               if (++i == ncomp_eqs)
                       goto spread_done;
       }

in the following patch you would have something like this:

       for_each_node_hop(hop, node) {
               struct cpumask hop_cpus = sched_numa_hop_mask(node, hop);

               for_each_cpu_andnot(cpu, hop_cpus, ...) {
                       cpus[i] = cpu;
                       if (++i == ncomp_eqs)
                               goto spread_done;
               }
       }

It looks more bulky, but I believe there will be more users for
for_each_node_hop() alone.

On top of that, if you really like it, you can implement
for_each_numa_hop_cpu() if you want.

> + * Implementation notes:
> + *
> + * Providing it is valid, the mask returned by
> + *  sched_numa_hop_mask(node, hops+1)
> + * is a superset of the one returned by
> + *   sched_numa_hop_mask(node, hops)
> + * which may not be that useful for drivers that try to spread things out and
> + * want to visit a CPU not more than once.
> + *
> + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
> + * a given distance away (rather than *up to* a given distance).
> + *
> + * hops=0 forces us to play silly games: we pass cpu_none_mask to
> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
> + */
> +#define for_each_numa_hop_cpu(cpu, node)				       \
> +	for (struct { const struct cpumask *curr, *prev; int hops; } __v =     \
> +		     { sched_numa_hop_mask(node, 0), NULL, 0 };		       \

This anonymous structure is never used as structure. What for you
define it? Why not just declare hops, prev and curr without packing
them?

Thanks,
Yury

> +	     !IS_ERR_OR_NULL(__v.curr);					       \
> +	     __v.hops++,                                                       \
> +	     __v.prev = __v.curr,					       \
> +	     __v.curr = sched_numa_hop_mask(node, __v.hops))                   \
> +		for_each_cpu_andnot(cpu,				       \
> +				    __v.curr,				       \
> +				    __v.hops ? __v.prev : cpu_none_mask)
>  
>  #endif /* _LINUX_TOPOLOGY_H */
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
@ 2022-09-25 15:00   ` Yury Norov
  2022-09-25 15:24     ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  2022-09-25 18:05   ` Yury Norov
  1 sibling, 2 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-25 15:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
> Tariq has pointed out that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness - cpumask_local_spread() only knows
> about the local node and everything outside is in the same bucket.
> 
> sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> of CPUs reachable within a given distance budget), introduce
> sched_numa_hop_mask() to export those cpumasks.
> 
> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  include/linux/topology.h | 12 ++++++++++++
>  kernel/sched/topology.c  | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..3e91ae6d0ad5 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>  	return cpumask_of_node(cpu_to_node(cpu));
>  }
>  
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	if (node == NUMA_NO_NODE && !hops)
> +		return cpu_online_mask;
> +
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +#endif	/* CONFIG_NUMA */
> +
>  
>  #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..ee77706603c0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  	return found;
>  }
>  
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in distance; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */

Since it's exported, can you declare function parameters and return
values properly?

> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node == NUMA_NO_NODE && !hops)
> +		return cpu_online_mask;
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!masks)
> +		return NULL;
> +
> +	return masks[hops][node];
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
> +
>  #endif /* CONFIG_NUMA */
>  
>  static int __sdt_alloc(const struct cpumask *cpu_map)
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot()
  2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider
@ 2022-09-25 15:23   ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  0 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2022-09-25 15:23 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote:
> for_each_cpu_and() is very convenient as it saves having to allocate a
> temporary cpumask to store the result of cpumask_and(). The same issue
> applies to cpumask_andnot() which doesn't actually need temporary storage
> for iteration purposes.
> 
> Following what has been done for for_each_cpu_and(), introduce
> for_each_cpu_andnot().
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 1b442fb2001f..4c69e338bb8c 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
>  		nr_cpumask_bits, n + 1);
>  }
>  
> +/**
> + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first cpumask pointer
> + * @src2p: the second cpumask pointer
> + *
> + * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p
> + */
> +static inline
> +unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p,
> +				 const struct cpumask *src2p)
> +{
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);

This is wrong. n-1 should be illegal here. The correct check is:
cpumask_check(n+1);

> +	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> +		nr_cpumask_bits, n + 1);
> +}
> +
>  /**
>   * for_each_cpu - iterate over every cpu in a mask
>   * @cpu: the (optionally unsigned) integer iterator
> @@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
>  		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
>  		(cpu) < nr_cpu_ids;)
>  
> +/**
> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
> + *			 those present in another.
> + * @cpu: the (optionally unsigned) integer iterator
> + * @mask1: the first cpumask pointer
> + * @mask2: the second cpumask pointer
> + *
> + * This saves a temporary CPU mask in many places.  It is equivalent to:
> + *	struct cpumask tmp;
> + *	cpumask_andnot(&tmp, &mask1, &mask2);
> + *	for_each_cpu(cpu, &tmp)
> + *		...
> + *
> + * After the loop, cpu is >= nr_cpu_ids.
> + */
> +#define for_each_cpu_andnot(cpu, mask1, mask2)				\
> +	for ((cpu) = -1;						\
> +		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
> +		(cpu) < nr_cpu_ids;)

This would raise cpumaks_check() warning at the very last iteration.
Because cpu is initialized insize the loop, you don't need to check it
at all. You can do it like this:

 #define for_each_cpu_andnot(cpu, mask1, mask2)				\
         for_each_andnot_bit(...)

Check this series for details (and please review).
https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/

Thanks,
Yury

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-25 15:00   ` Yury Norov
@ 2022-09-25 15:24     ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  1 sibling, 0 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-25 15:24 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Sun, Sep 25, 2022 at 08:00:49AM -0700, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
> > Tariq has pointed out that drivers allocating IRQ vectors would benefit
> > from having smarter NUMA-awareness - cpumask_local_spread() only knows
> > about the local node and everything outside is in the same bucket.
> > 
> > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> > of CPUs reachable within a given distance budget), introduce
> > sched_numa_hop_mask() to export those cpumasks.
> > 
> > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > ---
> >  include/linux/topology.h | 12 ++++++++++++
> >  kernel/sched/topology.c  | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> > 
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 4564faafd0e1..3e91ae6d0ad5 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  	return cpumask_of_node(cpu_to_node(cpu));
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> > +#else
> > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> > +{
> > +	if (node == NUMA_NO_NODE && !hops)
> > +		return cpu_online_mask;
> > +
> > +	return ERR_PTR(-EOPNOTSUPP);
> > +}
> > +#endif	/* CONFIG_NUMA */
> > +
> >  
> >  #endif /* _LINUX_TOPOLOGY_H */
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8739c2a5a54e..ee77706603c0 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> >  	return found;
> >  }
> >  
> > +/**
> > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> > + * @node: The node to count hops from.
> > + * @hops: Include CPUs up to that many hops away. 0 means local node.
> > + *
> > + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> > + * read-side section, copy it if required beyond that.
> > + *
> > + * Note that not all hops are equal in distance; see sched_init_numa() for how
> > + * distances and masks are handled.
> > + *
> > + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> > + * during the lifetime of the system (offline nodes are taken out of the masks).
> > + */
> 
> Since it's exported, can you declare function parameters and return
> values properly?

s/declare/describe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
  2022-09-25 15:00   ` Yury Norov
@ 2022-09-25 18:05   ` Yury Norov
  2022-09-25 18:13     ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  1 sibling, 2 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-25 18:05 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
> Tariq has pointed out that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness - cpumask_local_spread() only knows
> about the local node and everything outside is in the same bucket.
> 
> sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> of CPUs reachable within a given distance budget), introduce
> sched_numa_hop_mask() to export those cpumasks.
> 
> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>  include/linux/topology.h | 12 ++++++++++++
>  kernel/sched/topology.c  | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..3e91ae6d0ad5 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>  	return cpumask_of_node(cpu_to_node(cpu));
>  }
>  
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	if (node == NUMA_NO_NODE && !hops)
> +		return cpu_online_mask;
> +
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +#endif	/* CONFIG_NUMA */
> +
>  
>  #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..ee77706603c0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  	return found;
>  }
>  
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in distance; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */
> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node == NUMA_NO_NODE && !hops)
> +		return cpu_online_mask;
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);

This looks like a sanity check. If so, it should go before the snippet
above, so that client code would behave consistently.

> +
> +	if (!masks)
> +		return NULL;

In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here
you return NULL just to convert it to cpu_online_mask in the caller.
This looks inconsistent. So, together with the above comment, this
makes me feel that you'd do it like this:

 const struct cpumask *sched_numa_hop_mask(int node, int hops)
 {
	struct cpumask ***masks;

	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
        {
 #ifdef CONFIG_SCHED_DEBUG
                pr_err(...);
 #endif
		return ERR_PTR(-EINVAL);
        }

	if (node == NUMA_NO_NODE && !hops)
		return cpu_online_mask; /* or NULL */

        masks = rcu_dereference(sched_domains_numa_masks);
	if (!masks)
		return cpu_online_mask; /* or NULL */

	return masks[hops][node];
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-25 18:05   ` Yury Norov
@ 2022-09-25 18:13     ` Yury Norov
  2022-09-27 16:45     ` Valentin Schneider
  1 sibling, 0 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-25 18:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Sun, Sep 25, 2022 at 11:05:18AM -0700, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
> > Tariq has pointed out that drivers allocating IRQ vectors would benefit
> > from having smarter NUMA-awareness - cpumask_local_spread() only knows
> > about the local node and everything outside is in the same bucket.
> > 
> > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> > of CPUs reachable within a given distance budget), introduce
> > sched_numa_hop_mask() to export those cpumasks.
> > 
> > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > ---
> >  include/linux/topology.h | 12 ++++++++++++
> >  kernel/sched/topology.c  | 31 +++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> > 
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 4564faafd0e1..3e91ae6d0ad5 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
> >  	return cpumask_of_node(cpu_to_node(cpu));
> >  }
> >  
> > +#ifdef CONFIG_NUMA
> > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> > +#else
> > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> > +{
> > +	if (node == NUMA_NO_NODE && !hops)
> > +		return cpu_online_mask;
> > +
> > +	return ERR_PTR(-EOPNOTSUPP);
> > +}
> > +#endif	/* CONFIG_NUMA */
> > +
> >  
> >  #endif /* _LINUX_TOPOLOGY_H */
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 8739c2a5a54e..ee77706603c0 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> >  	return found;
> >  }
> >  
> > +/**
> > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> > + * @node: The node to count hops from.
> > + * @hops: Include CPUs up to that many hops away. 0 means local node.
> > + *
> > + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> > + * read-side section, copy it if required beyond that.
> > + *
> > + * Note that not all hops are equal in distance; see sched_init_numa() for how
> > + * distances and masks are handled.
> > + *
> > + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> > + * during the lifetime of the system (offline nodes are taken out of the masks).
> > + */
> > +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> > +{
> > +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> > +
> > +	if (node == NUMA_NO_NODE && !hops)
> > +		return cpu_online_mask;
> > +
> > +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> > +		return ERR_PTR(-EINVAL);
> 
> This looks like a sanity check. If so, it should go before the snippet
> above, so that client code would behave consistently.
> 
> > +
> > +	if (!masks)
> > +		return NULL;
> 
> In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here
> you return NULL just to convert it to cpu_online_mask in the caller.
> This looks inconsistent. So, together with the above comment, this
> makes me feel that you'd do it like this:
> 
>  const struct cpumask *sched_numa_hop_mask(int node, int hops)
>  {
> 	struct cpumask ***masks;
> 
> 	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
>         {
>  #ifdef CONFIG_SCHED_DEBUG
>                 pr_err(...);
>  #endif
> 		return ERR_PTR(-EINVAL);
>         }

It's an exported function, and any lame driver may crash the system by
dereferencing a random pointer.

You need to check the node for -2, -3, etc, because only -1 is a valid
negative value. For hops, it should be an unsigned int. Right?

> 
> 	if (node == NUMA_NO_NODE && !hops)
> 		return cpu_online_mask; /* or NULL */
> 
>         masks = rcu_dereference(sched_domains_numa_masks);
> 	if (!masks)
> 		return cpu_online_mask; /* or NULL */
> 
> 	return masks[hops][node];
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot()
  2022-09-25 15:23   ` Yury Norov
@ 2022-09-27 16:45     ` Valentin Schneider
  2022-09-27 20:02       ` Yury Norov
  0 siblings, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On 25/09/22 08:23, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote:
>> +/**
>> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
>> + *			 those present in another.
>> + * @cpu: the (optionally unsigned) integer iterator
>> + * @mask1: the first cpumask pointer
>> + * @mask2: the second cpumask pointer
>> + *
>> + * This saves a temporary CPU mask in many places.  It is equivalent to:
>> + *	struct cpumask tmp;
>> + *	cpumask_andnot(&tmp, &mask1, &mask2);
>> + *	for_each_cpu(cpu, &tmp)
>> + *		...
>> + *
>> + * After the loop, cpu is >= nr_cpu_ids.
>> + */
>> +#define for_each_cpu_andnot(cpu, mask1, mask2)				\
>> +	for ((cpu) = -1;						\
>> +		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
>> +		(cpu) < nr_cpu_ids;)
>
> This would raise cpumaks_check() warning at the very last iteration.
> Because cpu is initialized insize the loop, you don't need to check it
> at all. You can do it like this:
>
>  #define for_each_cpu_andnot(cpu, mask1, mask2)				\
>          for_each_andnot_bit(...)
>
> Check this series for details (and please review).
> https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/
>

Thanks, I'll have a look.

> Thanks,
> Yury


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-25 15:00   ` Yury Norov
  2022-09-25 15:24     ` Yury Norov
@ 2022-09-27 16:45     ` Valentin Schneider
  2022-09-27 19:30       ` Yury Norov
  1 sibling, 1 reply; 25+ messages in thread
From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On 25/09/22 08:00, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
>> +/**
>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
>> + * @node: The node to count hops from.
>> + * @hops: Include CPUs up to that many hops away. 0 means local node.
>> + *
>> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
>> + * read-side section, copy it if required beyond that.
>> + *
>> + * Note that not all hops are equal in distance; see sched_init_numa() for how
>> + * distances and masks are handled.
>> + *
>> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
>> + * during the lifetime of the system (offline nodes are taken out of the masks).
>> + */
>
> Since it's exported, can you declare function parameters and return
> values properly?
>

I'll add a bit about the return value; what is missing for the parameters?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-25 18:05   ` Yury Norov
  2022-09-25 18:13     ` Yury Norov
@ 2022-09-27 16:45     ` Valentin Schneider
  1 sibling, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On 25/09/22 11:05, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
>> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
>> +{
>> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
>> +
>> +	if (node == NUMA_NO_NODE && !hops)
>> +		return cpu_online_mask;
>> +
>> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
>> +		return ERR_PTR(-EINVAL);
>
> This looks like a sanity check. If so, it should go before the snippet
> above, so that client code would behave consistently.
>

nr_node_ids is unsigned, so -1 >= nr_node_ids is true.

>> +
>> +	if (!masks)
>> +		return NULL;
>
> In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here
> you return NULL just to convert it to cpu_online_mask in the caller.
> This looks inconsistent. So, together with the above comment, this
> makes me feel that you'd do it like this:
>
>  const struct cpumask *sched_numa_hop_mask(int node, int hops)
>  {
>       struct cpumask ***masks;
>
>       if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
>         {
>  #ifdef CONFIG_SCHED_DEBUG
>                 pr_err(...);
>  #endif
>               return ERR_PTR(-EINVAL);
>         }
>
>       if (node == NUMA_NO_NODE && !hops)
>               return cpu_online_mask; /* or NULL */
>
>         masks = rcu_dereference(sched_domains_numa_masks);
>       if (!masks)
>               return cpu_online_mask; /* or NULL */
>
>       return masks[hops][node];
>  }

If we're being pedantic, sched_numa_hop_mask() shouldn't return
cpu_online_mask in those cases, but that was the least horrible
option I found to get something sensible for the NUMA_NO_NODE /
!CONFIG_NUMA case. I might be able to better handle this with your
suggestion of having a mask iterator.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu()
  2022-09-25 14:58   ` Yury Norov
@ 2022-09-27 16:45     ` Valentin Schneider
  0 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw)
  To: Yury Norov
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On 25/09/22 07:58, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 04:55:41PM +0100, Valentin Schneider wrote:
>> +/**
>> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
>> + *                         starting from a given node.
>> + * @cpu: the iteration variable.
>> + * @node: the NUMA node to start the search from.
>> + *
>> + * Requires rcu_lock to be held.
>> + * Careful: this is a double loop, 'break' won't work as expected.
>
> This warning concerns me not only because new iteration loop hides
> complexity and breaks 'break' (sic!), but also because it looks too
> specific. Why don't you split it, so instead:
>
>        for_each_numa_hop_cpu(cpu, dev->priv.numa_node) {
>                cpus[i] = cpu;
>                if (++i == ncomp_eqs)
>                        goto spread_done;
>        }
>
> in the following patch you would have something like this:
>
>        for_each_node_hop(hop, node) {
>                struct cpumask hop_cpus = sched_numa_hop_mask(node, hop);
>
>                for_each_cpu_andnot(cpu, hop_cpus, ...) {
>                        cpus[i] = cpu;
>                        if (++i == ncomp_eqs)
>                                goto spread_done;
>                }
>        }
>
> It looks more bulky, but I believe there will be more users for
> for_each_node_hop() alone.
>
> On top of that, if you really like it, you can implement
> for_each_numa_hop_cpu() if you want.
>

IIUC you're suggesting to introduce an iterator for the cpumasks first, and
then maybe add one on top for the individual cpus.

I'm happy to do that, though I have to say I'm keen to keep the CPU
iterator - IMO the complexity is justified if it is centralized in one
location and saves us from boring old boilerplate code.

>> + * Implementation notes:
>> + *
>> + * Providing it is valid, the mask returned by
>> + *  sched_numa_hop_mask(node, hops+1)
>> + * is a superset of the one returned by
>> + *   sched_numa_hop_mask(node, hops)
>> + * which may not be that useful for drivers that try to spread things out and
>> + * want to visit a CPU not more than once.
>> + *
>> + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus
>> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
>> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
>> + * a given distance away (rather than *up to* a given distance).
>> + *
>> + * hops=0 forces us to play silly games: we pass cpu_none_mask to
>> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
>> + */
>> +#define for_each_numa_hop_cpu(cpu, node)				       \
>> +	for (struct { const struct cpumask *curr, *prev; int hops; } __v =     \
>> +		     { sched_numa_hop_mask(node, 0), NULL, 0 };		       \
>
> This anonymous structure is never used as structure. What for you
> define it? Why not just declare hops, prev and curr without packing
> them?
>

I haven't found a way to do this that doesn't involve a struct - apparently
you can't mix types in a for loop declaration clause.

> Thanks,
> Yury
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask()
  2022-09-27 16:45     ` Valentin Schneider
@ 2022-09-27 19:30       ` Yury Norov
  0 siblings, 0 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-27 19:30 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Tue, Sep 27, 2022 at 05:45:10PM +0100, Valentin Schneider wrote:
> On 25/09/22 08:00, Yury Norov wrote:
> > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote:
> >> +/**
> >> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> >> + * @node: The node to count hops from.
> >> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> >> + *
> >> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> >> + * read-side section, copy it if required beyond that.
> >> + *
> >> + * Note that not all hops are equal in distance; see sched_init_numa() for how
> >> + * distances and masks are handled.
> >> + *
> >> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> >> + * during the lifetime of the system (offline nodes are taken out of the masks).
> >> + */
> >
> > Since it's exported, can you declare function parameters and return
> > values properly?
> >
> 
> I'll add a bit about the return value; what is missing for the parameters?

My bad, scratch this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot()
  2022-09-27 16:45     ` Valentin Schneider
@ 2022-09-27 20:02       ` Yury Norov
  0 siblings, 0 replies; 25+ messages in thread
From: Yury Norov @ 2022-09-27 20:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed,
	Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens,
	Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan,
	Jesse Brandeburg

On Tue, Sep 27, 2022 at 05:45:04PM +0100, Valentin Schneider wrote:
> On 25/09/22 08:23, Yury Norov wrote:
> > On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote:
> >> +/**
> >> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding
> >> + *			 those present in another.
> >> + * @cpu: the (optionally unsigned) integer iterator
> >> + * @mask1: the first cpumask pointer
> >> + * @mask2: the second cpumask pointer
> >> + *
> >> + * This saves a temporary CPU mask in many places.  It is equivalent to:
> >> + *	struct cpumask tmp;
> >> + *	cpumask_andnot(&tmp, &mask1, &mask2);
> >> + *	for_each_cpu(cpu, &tmp)
> >> + *		...
> >> + *
> >> + * After the loop, cpu is >= nr_cpu_ids.
> >> + */
> >> +#define for_each_cpu_andnot(cpu, mask1, mask2)				\
> >> +	for ((cpu) = -1;						\
> >> +		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
> >> +		(cpu) < nr_cpu_ids;)
> >
> > This would raise cpumaks_check() warning at the very last iteration.
> > Because cpu is initialized insize the loop, you don't need to check it
> > at all. You can do it like this:
> >
> >  #define for_each_cpu_andnot(cpu, mask1, mask2)				\
> >          for_each_andnot_bit(...)
> >
> > Check this series for details (and please review).
> > https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/
> >
> 
> Thanks, I'll have a look.

Also, if you send first 4 patches as a separate series on top of
bitmap-for-next, I'll be able to include them in bitmap-for-next
and then in 6.1 pull-request.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
  2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
                   ` (8 preceding siblings ...)
  2022-09-25  7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan
@ 2022-10-18  6:36 ` Tariq Toukan
  2022-10-18 16:50   ` Valentin Schneider
  9 siblings, 1 reply; 25+ messages in thread
From: Tariq Toukan @ 2022-10-18  6:36 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg


On 9/23/2022 4:25 PM, Valentin Schneider wrote:
> Hi folks,
> 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>    for_each_cpu(cpu, mask) {
>        ...
>        cpumask_andnot(mask, ...);
>    }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>    
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>    hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 

Hi,

What's the status of this?
Do we have agreement on the changes needed for the next respin?

Regards,
Tariq

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface
  2022-10-18  6:36 ` Tariq Toukan
@ 2022-10-18 16:50   ` Valentin Schneider
  0 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2022-10-18 16:50 UTC (permalink / raw)
  To: Tariq Toukan, netdev, linux-rdma, linux-kernel
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman,
	Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman,
	Tariq Toukan, Jesse Brandeburg

On 18/10/22 09:36, Tariq Toukan wrote:
>
> Hi,
>
> What's the status of this?
> Do we have agreement on the changes needed for the next respin?
>

Yep, the bitmap patches are in 6.1-rc1, I need to respin the topology ones
to address Yury's comments. It's in my todolist, I'll get to it soonish.

> Regards,
> Tariq


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-10-18 16:50 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider
2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider
2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov
2022-09-23 15:49   ` Valentin Schneider
2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider
2022-09-25 15:23   ` Yury Norov
2022-09-27 16:45     ` Valentin Schneider
2022-09-27 20:02       ` Yury Norov
2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider
2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider
2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
2022-09-25 15:00   ` Yury Norov
2022-09-25 15:24     ` Yury Norov
2022-09-27 16:45     ` Valentin Schneider
2022-09-27 19:30       ` Yury Norov
2022-09-25 18:05   ` Yury Norov
2022-09-25 18:13     ` Yury Norov
2022-09-27 16:45     ` Valentin Schneider
2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider
2022-09-25 14:58   ` Yury Norov
2022-09-27 16:45     ` Valentin Schneider
2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider
2022-09-25  7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan
2022-10-18  6:36 ` Tariq Toukan
2022-10-18 16:50   ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).