* [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface @ 2022-09-23 13:25 Valentin Schneider 2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider ` (9 more replies) 0 siblings, 10 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 13:25 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg Hi folks, Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut it). The proposed interface involved an array of CPUs and a temporary cpumask, and being my difficult self what I'm proposing here is an interface that doesn't require any temporary storage other than some stack variables (at the cost of one wild macro). Please note that this is based on top of Yury's bitmap-for-next [2] to leverage his fancy new FIND_NEXT_BIT() macro. [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/ [2]: https://github.com/norov/linux.git/ -b bitmap-for-next A note on treewide use of for_each_cpu_andnot() =============================================== I've used the below coccinelle script to find places that could be patched (I couldn't figure out the valid syntax to patch from coccinelle itself): ,----- @tmpandnot@ expression tmpmask; iterator for_each_cpu; position p; statement S; @@ cpumask_andnot(tmpmask, ...); ... ( for_each_cpu@p(..., tmpmask, ...) S | for_each_cpu@p(..., tmpmask, ...) { ... } ) @script:python depends on tmpandnot@ p << tmpandnot.p; @@ coccilib.report.print_report(p[0], "andnot loop here") '----- Which yields (against c40e8341e3b3): .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here .//kernel/sched/core.c:345:1-13: andnot loop here .//kernel/sched/core.c:366:1-13: andnot loop here .//net/core/dev.c:3058:1-13: andnot loop here A lot of those are actually of the shape for_each_cpu(cpu, mask) { ... cpumask_andnot(mask, ...); } I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(), but I decided to just stick to the one obvious one in __sched_core_flip(). Revisions ========= v3 -> v4 ++++++++ o Rebased on top of Yury's bitmap-for-next o Added Tariq's mlx5e patch o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE && hops=0 case v2 -> v3 ++++++++ o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury) o New patches to fix issues raised by running the above o New patch to use for_each_cpu_andnot() in sched/core.c (Yury) v1 -> v2 ++++++++ o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury) o Rebase onto v6.0-rc1 Cheers, Valentin Tariq Toukan (1): net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider (6): lib/find_bit: Introduce find_next_andnot_bit() cpumask: Introduce for_each_cpu_andnot() lib/test_cpumask: Add for_each_cpu_and(not) tests sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() sched/topology: Introduce sched_numa_hop_mask() sched/topology: Introduce for_each_numa_hop_cpu() drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++- include/linux/cpumask.h | 39 ++++++++++++++++ include/linux/find.h | 33 +++++++++++++ include/linux/topology.h | 49 ++++++++++++++++++++ kernel/sched/core.c | 5 +- kernel/sched/topology.c | 31 +++++++++++++ lib/cpumask_kunit.c | 19 ++++++++ lib/find_bit.c | 9 ++++ 8 files changed, 192 insertions(+), 6 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider @ 2022-09-23 13:25 ` Valentin Schneider 2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov ` (8 subsequent siblings) 9 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 13:25 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg In preparation of introducing for_each_cpu_andnot(), add a variant of find_next_bit() that negate the bits in @addr2 when ANDing them with the bits in @addr1. Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- include/linux/find.h | 33 +++++++++++++++++++++++++++++++++ lib/find_bit.c | 9 +++++++++ 2 files changed, 42 insertions(+) diff --git a/include/linux/find.h b/include/linux/find.h index dead6f53a97b..e60b1ce89b29 100644 --- a/include/linux/find.h +++ b/include/linux/find.h @@ -12,6 +12,8 @@ unsigned long _find_next_bit(const unsigned long *addr1, unsigned long nbits, unsigned long start); unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2, unsigned long nbits, unsigned long start); +unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, + unsigned long nbits, unsigned long start); unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, unsigned long start); extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size); @@ -86,6 +88,37 @@ unsigned long find_next_and_bit(const unsigned long *addr1, } #endif +#ifndef find_next_andnot_bit +/** + * find_next_andnot_bit - find the next set bit in *addr1 excluding all the bits + * in *addr2 + * @addr1: The first address to base the search on + * @addr2: The second address to base the search on + * @size: The bitmap size in bits + * @offset: The bitnumber to start searching at + * + * Returns the bit number for the next set bit + * If no bits are set, returns @size. + */ +static inline +unsigned long find_next_andnot_bit(const unsigned long *addr1, + const unsigned long *addr2, unsigned long size, + unsigned long offset) +{ + if (small_const_nbits(size)) { + unsigned long val; + + if (unlikely(offset >= size)) + return size; + + val = *addr1 & ~*addr2 & GENMASK(size - 1, offset); + return val ? __ffs(val) : size; + } + + return _find_next_andnot_bit(addr1, addr2, size, offset); +} +#endif + #ifndef find_next_zero_bit /** * find_next_zero_bit - find the next cleared bit in a memory region diff --git a/lib/find_bit.c b/lib/find_bit.c index d00ee23ab657..53b02405421b 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -120,6 +120,15 @@ unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long EXPORT_SYMBOL(_find_next_and_bit); #endif +#ifndef find_next_andnot_bit +unsigned long _find_next_andnot_bit(const unsigned long *addr1, const unsigned long *addr2, + unsigned long nbits, unsigned long start) +{ + return FIND_NEXT_BIT(addr1[idx] & ~addr2[idx], /* nop */, nbits, start); +} +EXPORT_SYMBOL(_find_next_andnot_bit); +#endif + #ifndef find_next_zero_bit unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits, unsigned long start) -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider 2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider @ 2022-09-23 15:44 ` Yury Norov 2022-09-23 15:49 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider ` (7 subsequent siblings) 9 siblings, 1 reply; 25+ messages in thread From: Yury Norov @ 2022-09-23 15:44 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote: > Hi folks, Hi, I received only 1st patch of the series. Can you give me a link for the full series so that I'll see how the new API is used? Thanks, Yury > Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit > from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut > it). > > The proposed interface involved an array of CPUs and a temporary cpumask, and > being my difficult self what I'm proposing here is an interface that doesn't > require any temporary storage other than some stack variables (at the cost of > one wild macro). > > Please note that this is based on top of Yury's bitmap-for-next [2] to leverage > his fancy new FIND_NEXT_BIT() macro. > > [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/ > [2]: https://github.com/norov/linux.git/ -b bitmap-for-next > > A note on treewide use of for_each_cpu_andnot() > =============================================== > > I've used the below coccinelle script to find places that could be patched (I > couldn't figure out the valid syntax to patch from coccinelle itself): > > ,----- > @tmpandnot@ > expression tmpmask; > iterator for_each_cpu; > position p; > statement S; > @@ > cpumask_andnot(tmpmask, ...); > > ... > > ( > for_each_cpu@p(..., tmpmask, ...) > S > | > for_each_cpu@p(..., tmpmask, ...) > { > ... > } > ) > > @script:python depends on tmpandnot@ > p << tmpandnot.p; > @@ > coccilib.report.print_report(p[0], "andnot loop here") > '----- > > Which yields (against c40e8341e3b3): > > .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here > .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here > .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here > .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here > .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here > .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here > .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here > .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here > .//kernel/sched/core.c:345:1-13: andnot loop here > .//kernel/sched/core.c:366:1-13: andnot loop here > .//net/core/dev.c:3058:1-13: andnot loop here > > A lot of those are actually of the shape > > for_each_cpu(cpu, mask) { > ... > cpumask_andnot(mask, ...); > } > > I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(), > but I decided to just stick to the one obvious one in __sched_core_flip(). > > Revisions > ========= > > v3 -> v4 > ++++++++ > > o Rebased on top of Yury's bitmap-for-next > o Added Tariq's mlx5e patch > o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE && > hops=0 case > > v2 -> v3 > ++++++++ > > o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury) > o New patches to fix issues raised by running the above > > o New patch to use for_each_cpu_andnot() in sched/core.c (Yury) > > v1 -> v2 > ++++++++ > > o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury) > o Rebase onto v6.0-rc1 > > Cheers, > Valentin > > Tariq Toukan (1): > net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity > hints > > Valentin Schneider (6): > lib/find_bit: Introduce find_next_andnot_bit() > cpumask: Introduce for_each_cpu_andnot() > lib/test_cpumask: Add for_each_cpu_and(not) tests > sched/core: Merge cpumask_andnot()+for_each_cpu() into > for_each_cpu_andnot() > sched/topology: Introduce sched_numa_hop_mask() > sched/topology: Introduce for_each_numa_hop_cpu() > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++- > include/linux/cpumask.h | 39 ++++++++++++++++ > include/linux/find.h | 33 +++++++++++++ > include/linux/topology.h | 49 ++++++++++++++++++++ > kernel/sched/core.c | 5 +- > kernel/sched/topology.c | 31 +++++++++++++ > lib/cpumask_kunit.c | 19 ++++++++ > lib/find_bit.c | 9 ++++ > 8 files changed, 192 insertions(+), 6 deletions(-) > > -- > 2.31.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface 2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov @ 2022-09-23 15:49 ` Valentin Schneider 0 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:49 UTC (permalink / raw) To: Yury Norov Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 23/09/22 08:44, Yury Norov wrote: > On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote: >> Hi folks, > > Hi, > > I received only 1st patch of the series. Can you give me a link for > the full series so that I'll see how the new API is used? > Sigh, I got this when sending these out: 4.3.0 Temporary System Problem. Try again later (10) I'm going to re-send the missing ones and *hopefully* have them thread properly. Sorry about that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider 2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider 2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-25 15:23 ` Yury Norov 2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider ` (6 subsequent siblings) 9 siblings, 1 reply; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg for_each_cpu_and() is very convenient as it saves having to allocate a temporary cpumask to store the result of cpumask_and(). The same issue applies to cpumask_andnot() which doesn't actually need temporary storage for iteration purposes. Following what has been done for for_each_cpu_and(), introduce for_each_cpu_andnot(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 1b442fb2001f..4c69e338bb8c 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p, nr_cpumask_bits, n + 1); } +/** + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p + * @n: the cpu prior to the place to search (ie. return will be > @n) + * @src1p: the first cpumask pointer + * @src2p: the second cpumask pointer + * + * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p + */ +static inline +unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p, + const struct cpumask *src2p) +{ + /* -1 is a legal arg here. */ + if (n != -1) + cpumask_check(n); + return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p), + nr_cpumask_bits, n + 1); +} + /** * for_each_cpu - iterate over every cpu in a mask * @cpu: the (optionally unsigned) integer iterator @@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta (cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \ (cpu) < nr_cpu_ids;) +/** + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding + * those present in another. + * @cpu: the (optionally unsigned) integer iterator + * @mask1: the first cpumask pointer + * @mask2: the second cpumask pointer + * + * This saves a temporary CPU mask in many places. It is equivalent to: + * struct cpumask tmp; + * cpumask_andnot(&tmp, &mask1, &mask2); + * for_each_cpu(cpu, &tmp) + * ... + * + * After the loop, cpu is >= nr_cpu_ids. + */ +#define for_each_cpu_andnot(cpu, mask1, mask2) \ + for ((cpu) = -1; \ + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \ + (cpu) < nr_cpu_ids;) + /** * cpumask_any_but - return a "random" in a cpumask, but not this one. * @mask: the cpumask to search -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() 2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider @ 2022-09-25 15:23 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 0 siblings, 1 reply; 25+ messages in thread From: Yury Norov @ 2022-09-25 15:23 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote: > for_each_cpu_and() is very convenient as it saves having to allocate a > temporary cpumask to store the result of cpumask_and(). The same issue > applies to cpumask_andnot() which doesn't actually need temporary storage > for iteration purposes. > > Following what has been done for for_each_cpu_and(), introduce > for_each_cpu_andnot(). > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > include/linux/cpumask.h | 39 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h > index 1b442fb2001f..4c69e338bb8c 100644 > --- a/include/linux/cpumask.h > +++ b/include/linux/cpumask.h > @@ -238,6 +238,25 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p, > nr_cpumask_bits, n + 1); > } > > +/** > + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p > + * @n: the cpu prior to the place to search (ie. return will be > @n) > + * @src1p: the first cpumask pointer > + * @src2p: the second cpumask pointer > + * > + * Returns >= nr_cpu_ids if no further cpus set in *src1p & ~*src2p > + */ > +static inline > +unsigned int cpumask_next_andnot(int n, const struct cpumask *src1p, > + const struct cpumask *src2p) > +{ > + /* -1 is a legal arg here. */ > + if (n != -1) > + cpumask_check(n); This is wrong. n-1 should be illegal here. The correct check is: cpumask_check(n+1); > + return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p), > + nr_cpumask_bits, n + 1); > +} > + > /** > * for_each_cpu - iterate over every cpu in a mask > * @cpu: the (optionally unsigned) integer iterator > @@ -317,6 +336,26 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta > (cpu) = cpumask_next_and((cpu), (mask1), (mask2)), \ > (cpu) < nr_cpu_ids;) > > +/** > + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding > + * those present in another. > + * @cpu: the (optionally unsigned) integer iterator > + * @mask1: the first cpumask pointer > + * @mask2: the second cpumask pointer > + * > + * This saves a temporary CPU mask in many places. It is equivalent to: > + * struct cpumask tmp; > + * cpumask_andnot(&tmp, &mask1, &mask2); > + * for_each_cpu(cpu, &tmp) > + * ... > + * > + * After the loop, cpu is >= nr_cpu_ids. > + */ > +#define for_each_cpu_andnot(cpu, mask1, mask2) \ > + for ((cpu) = -1; \ > + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \ > + (cpu) < nr_cpu_ids;) This would raise cpumaks_check() warning at the very last iteration. Because cpu is initialized insize the loop, you don't need to check it at all. You can do it like this: #define for_each_cpu_andnot(cpu, mask1, mask2) \ for_each_andnot_bit(...) Check this series for details (and please review). https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/ Thanks, Yury ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() 2022-09-25 15:23 ` Yury Norov @ 2022-09-27 16:45 ` Valentin Schneider 2022-09-27 20:02 ` Yury Norov 0 siblings, 1 reply; 25+ messages in thread From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw) To: Yury Norov Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 25/09/22 08:23, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote: >> +/** >> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding >> + * those present in another. >> + * @cpu: the (optionally unsigned) integer iterator >> + * @mask1: the first cpumask pointer >> + * @mask2: the second cpumask pointer >> + * >> + * This saves a temporary CPU mask in many places. It is equivalent to: >> + * struct cpumask tmp; >> + * cpumask_andnot(&tmp, &mask1, &mask2); >> + * for_each_cpu(cpu, &tmp) >> + * ... >> + * >> + * After the loop, cpu is >= nr_cpu_ids. >> + */ >> +#define for_each_cpu_andnot(cpu, mask1, mask2) \ >> + for ((cpu) = -1; \ >> + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \ >> + (cpu) < nr_cpu_ids;) > > This would raise cpumaks_check() warning at the very last iteration. > Because cpu is initialized insize the loop, you don't need to check it > at all. You can do it like this: > > #define for_each_cpu_andnot(cpu, mask1, mask2) \ > for_each_andnot_bit(...) > > Check this series for details (and please review). > https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/ > Thanks, I'll have a look. > Thanks, > Yury ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() 2022-09-27 16:45 ` Valentin Schneider @ 2022-09-27 20:02 ` Yury Norov 0 siblings, 0 replies; 25+ messages in thread From: Yury Norov @ 2022-09-27 20:02 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Tue, Sep 27, 2022 at 05:45:04PM +0100, Valentin Schneider wrote: > On 25/09/22 08:23, Yury Norov wrote: > > On Fri, Sep 23, 2022 at 04:55:37PM +0100, Valentin Schneider wrote: > >> +/** > >> + * for_each_cpu_andnot - iterate over every cpu present in one mask, excluding > >> + * those present in another. > >> + * @cpu: the (optionally unsigned) integer iterator > >> + * @mask1: the first cpumask pointer > >> + * @mask2: the second cpumask pointer > >> + * > >> + * This saves a temporary CPU mask in many places. It is equivalent to: > >> + * struct cpumask tmp; > >> + * cpumask_andnot(&tmp, &mask1, &mask2); > >> + * for_each_cpu(cpu, &tmp) > >> + * ... > >> + * > >> + * After the loop, cpu is >= nr_cpu_ids. > >> + */ > >> +#define for_each_cpu_andnot(cpu, mask1, mask2) \ > >> + for ((cpu) = -1; \ > >> + (cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)), \ > >> + (cpu) < nr_cpu_ids;) > > > > This would raise cpumaks_check() warning at the very last iteration. > > Because cpu is initialized insize the loop, you don't need to check it > > at all. You can do it like this: > > > > #define for_each_cpu_andnot(cpu, mask1, mask2) \ > > for_each_andnot_bit(...) > > > > Check this series for details (and please review). > > https://lore.kernel.org/all/20220919210559.1509179-8-yury.norov@gmail.com/T/ > > > > Thanks, I'll have a look. Also, if you send first 4 patches as a separate series on top of bitmap-for-next, I'll be able to include them in bitmap-for-next and then in 6.1 pull-request. Thanks, Yury ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (2 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider ` (5 subsequent siblings) 9 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Yury Norov, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg Following the recent introduction of for_each_andnot(), add some tests to ensure for_each_cpu_and(not) results in the same as iterating over the result of cpumask_and(not)(). Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- lib/cpumask_kunit.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/lib/cpumask_kunit.c b/lib/cpumask_kunit.c index ecbeec72221e..d1fc6ece21f3 100644 --- a/lib/cpumask_kunit.c +++ b/lib/cpumask_kunit.c @@ -33,6 +33,19 @@ KUNIT_EXPECT_EQ_MSG((test), nr_cpu_ids - mask_weight, iter, MASK_MSG(mask)); \ } while (0) +#define EXPECT_FOR_EACH_CPU_OP_EQ(test, op, mask1, mask2) \ + do { \ + const cpumask_t *m1 = (mask1); \ + const cpumask_t *m2 = (mask2); \ + int weight; \ + int cpu, iter = 0; \ + cpumask_##op(&mask_tmp, m1, m2); \ + weight = cpumask_weight(&mask_tmp); \ + for_each_cpu_##op(cpu, mask1, mask2) \ + iter++; \ + KUNIT_EXPECT_EQ((test), weight, iter); \ + } while (0) + #define EXPECT_FOR_EACH_CPU_WRAP_EQ(test, mask) \ do { \ const cpumask_t *m = (mask); \ @@ -54,6 +67,7 @@ static cpumask_t mask_empty; static cpumask_t mask_all; +static cpumask_t mask_tmp; static void test_cpumask_weight(struct kunit *test) { @@ -101,10 +115,15 @@ static void test_cpumask_iterators(struct kunit *test) EXPECT_FOR_EACH_CPU_EQ(test, &mask_empty); EXPECT_FOR_EACH_CPU_NOT_EQ(test, &mask_empty); EXPECT_FOR_EACH_CPU_WRAP_EQ(test, &mask_empty); + EXPECT_FOR_EACH_CPU_OP_EQ(test, and, &mask_empty, &mask_empty); + EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, &mask_empty); + EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, &mask_empty, &mask_empty); EXPECT_FOR_EACH_CPU_EQ(test, cpu_possible_mask); EXPECT_FOR_EACH_CPU_NOT_EQ(test, cpu_possible_mask); EXPECT_FOR_EACH_CPU_WRAP_EQ(test, cpu_possible_mask); + EXPECT_FOR_EACH_CPU_OP_EQ(test, and, cpu_possible_mask, cpu_possible_mask); + EXPECT_FOR_EACH_CPU_OP_EQ(test, andnot, cpu_possible_mask, &mask_empty); } static void test_cpumask_iterators_builtin(struct kunit *test) -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (3 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider ` (4 subsequent siblings) 9 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Yury Norov, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg This removes the second use of the sched_core_mask temporary mask. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- kernel/sched/core.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..b4c3112b0095 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -360,10 +360,7 @@ static void __sched_core_flip(bool enabled) /* * Toggle the offline CPUs. */ - cpumask_copy(&sched_core_mask, cpu_possible_mask); - cpumask_andnot(&sched_core_mask, &sched_core_mask, cpu_online_mask); - - for_each_cpu(cpu, &sched_core_mask) + for_each_cpu_andnot(cpu, cpu_possible_mask, cpu_online_mask) cpu_rq(cpu)->core_enabled = enabled; cpus_read_unlock(); -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (4 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-25 15:00 ` Yury Norov 2022-09-25 18:05 ` Yury Norov 2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider ` (3 subsequent siblings) 9 siblings, 2 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg Tariq has pointed out that drivers allocating IRQ vectors would benefit from having smarter NUMA-awareness - cpumask_local_spread() only knows about the local node and everything outside is in the same bucket. sched_domains_numa_masks is pretty much what we want to hand out (a cpumask of CPUs reachable within a given distance budget), introduce sched_numa_hop_mask() to export those cpumasks. Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- include/linux/topology.h | 12 ++++++++++++ kernel/sched/topology.c | 31 +++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/include/linux/topology.h b/include/linux/topology.h index 4564faafd0e1..3e91ae6d0ad5 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu) return cpumask_of_node(cpu_to_node(cpu)); } +#ifdef CONFIG_NUMA +extern const struct cpumask *sched_numa_hop_mask(int node, int hops); +#else +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) +{ + if (node == NUMA_NO_NODE && !hops) + return cpu_online_mask; + + return ERR_PTR(-EOPNOTSUPP); +} +#endif /* CONFIG_NUMA */ + #endif /* _LINUX_TOPOLOGY_H */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 8739c2a5a54e..ee77706603c0 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) return found; } +/** + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. + * @node: The node to count hops from. + * @hops: Include CPUs up to that many hops away. 0 means local node. + * + * Requires rcu_lock to be held. Returned cpumask is only valid within that + * read-side section, copy it if required beyond that. + * + * Note that not all hops are equal in distance; see sched_init_numa() for how + * distances and masks are handled. + * + * Also note that this is a reflection of sched_domains_numa_masks, which may change + * during the lifetime of the system (offline nodes are taken out of the masks). + */ +const struct cpumask *sched_numa_hop_mask(int node, int hops) +{ + struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks); + + if (node == NUMA_NO_NODE && !hops) + return cpu_online_mask; + + if (node >= nr_node_ids || hops >= sched_domains_numa_levels) + return ERR_PTR(-EINVAL); + + if (!masks) + return NULL; + + return masks[hops][node]; +} +EXPORT_SYMBOL_GPL(sched_numa_hop_mask); + #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider @ 2022-09-25 15:00 ` Yury Norov 2022-09-25 15:24 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 2022-09-25 18:05 ` Yury Norov 1 sibling, 2 replies; 25+ messages in thread From: Yury Norov @ 2022-09-25 15:00 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: > Tariq has pointed out that drivers allocating IRQ vectors would benefit > from having smarter NUMA-awareness - cpumask_local_spread() only knows > about the local node and everything outside is in the same bucket. > > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask > of CPUs reachable within a given distance budget), introduce > sched_numa_hop_mask() to export those cpumasks. > > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > include/linux/topology.h | 12 ++++++++++++ > kernel/sched/topology.c | 31 +++++++++++++++++++++++++++++++ > 2 files changed, 43 insertions(+) > > diff --git a/include/linux/topology.h b/include/linux/topology.h > index 4564faafd0e1..3e91ae6d0ad5 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu) > return cpumask_of_node(cpu_to_node(cpu)); > } > > +#ifdef CONFIG_NUMA > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops); > +#else > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) > +{ > + if (node == NUMA_NO_NODE && !hops) > + return cpu_online_mask; > + > + return ERR_PTR(-EOPNOTSUPP); > +} > +#endif /* CONFIG_NUMA */ > + > > #endif /* _LINUX_TOPOLOGY_H */ > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 8739c2a5a54e..ee77706603c0 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) > return found; > } > > +/** > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. > + * @node: The node to count hops from. > + * @hops: Include CPUs up to that many hops away. 0 means local node. > + * > + * Requires rcu_lock to be held. Returned cpumask is only valid within that > + * read-side section, copy it if required beyond that. > + * > + * Note that not all hops are equal in distance; see sched_init_numa() for how > + * distances and masks are handled. > + * > + * Also note that this is a reflection of sched_domains_numa_masks, which may change > + * during the lifetime of the system (offline nodes are taken out of the masks). > + */ Since it's exported, can you declare function parameters and return values properly? > +const struct cpumask *sched_numa_hop_mask(int node, int hops) > +{ > + struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks); > + > + if (node == NUMA_NO_NODE && !hops) > + return cpu_online_mask; > + > + if (node >= nr_node_ids || hops >= sched_domains_numa_levels) > + return ERR_PTR(-EINVAL); > + > + if (!masks) > + return NULL; > + > + return masks[hops][node]; > +} > +EXPORT_SYMBOL_GPL(sched_numa_hop_mask); > + > #endif /* CONFIG_NUMA */ > > static int __sdt_alloc(const struct cpumask *cpu_map) > -- > 2.31.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-25 15:00 ` Yury Norov @ 2022-09-25 15:24 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 1 sibling, 0 replies; 25+ messages in thread From: Yury Norov @ 2022-09-25 15:24 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Sun, Sep 25, 2022 at 08:00:49AM -0700, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: > > Tariq has pointed out that drivers allocating IRQ vectors would benefit > > from having smarter NUMA-awareness - cpumask_local_spread() only knows > > about the local node and everything outside is in the same bucket. > > > > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask > > of CPUs reachable within a given distance budget), introduce > > sched_numa_hop_mask() to export those cpumasks. > > > > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > > --- > > include/linux/topology.h | 12 ++++++++++++ > > kernel/sched/topology.c | 31 +++++++++++++++++++++++++++++++ > > 2 files changed, 43 insertions(+) > > > > diff --git a/include/linux/topology.h b/include/linux/topology.h > > index 4564faafd0e1..3e91ae6d0ad5 100644 > > --- a/include/linux/topology.h > > +++ b/include/linux/topology.h > > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu) > > return cpumask_of_node(cpu_to_node(cpu)); > > } > > > > +#ifdef CONFIG_NUMA > > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops); > > +#else > > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) > > +{ > > + if (node == NUMA_NO_NODE && !hops) > > + return cpu_online_mask; > > + > > + return ERR_PTR(-EOPNOTSUPP); > > +} > > +#endif /* CONFIG_NUMA */ > > + > > > > #endif /* _LINUX_TOPOLOGY_H */ > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 8739c2a5a54e..ee77706603c0 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) > > return found; > > } > > > > +/** > > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. > > + * @node: The node to count hops from. > > + * @hops: Include CPUs up to that many hops away. 0 means local node. > > + * > > + * Requires rcu_lock to be held. Returned cpumask is only valid within that > > + * read-side section, copy it if required beyond that. > > + * > > + * Note that not all hops are equal in distance; see sched_init_numa() for how > > + * distances and masks are handled. > > + * > > + * Also note that this is a reflection of sched_domains_numa_masks, which may change > > + * during the lifetime of the system (offline nodes are taken out of the masks). > > + */ > > Since it's exported, can you declare function parameters and return > values properly? s/declare/describe ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-25 15:00 ` Yury Norov 2022-09-25 15:24 ` Yury Norov @ 2022-09-27 16:45 ` Valentin Schneider 2022-09-27 19:30 ` Yury Norov 1 sibling, 1 reply; 25+ messages in thread From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw) To: Yury Norov Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 25/09/22 08:00, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: >> +/** >> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. >> + * @node: The node to count hops from. >> + * @hops: Include CPUs up to that many hops away. 0 means local node. >> + * >> + * Requires rcu_lock to be held. Returned cpumask is only valid within that >> + * read-side section, copy it if required beyond that. >> + * >> + * Note that not all hops are equal in distance; see sched_init_numa() for how >> + * distances and masks are handled. >> + * >> + * Also note that this is a reflection of sched_domains_numa_masks, which may change >> + * during the lifetime of the system (offline nodes are taken out of the masks). >> + */ > > Since it's exported, can you declare function parameters and return > values properly? > I'll add a bit about the return value; what is missing for the parameters? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-27 16:45 ` Valentin Schneider @ 2022-09-27 19:30 ` Yury Norov 0 siblings, 0 replies; 25+ messages in thread From: Yury Norov @ 2022-09-27 19:30 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Tue, Sep 27, 2022 at 05:45:10PM +0100, Valentin Schneider wrote: > On 25/09/22 08:00, Yury Norov wrote: > > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: > >> +/** > >> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. > >> + * @node: The node to count hops from. > >> + * @hops: Include CPUs up to that many hops away. 0 means local node. > >> + * > >> + * Requires rcu_lock to be held. Returned cpumask is only valid within that > >> + * read-side section, copy it if required beyond that. > >> + * > >> + * Note that not all hops are equal in distance; see sched_init_numa() for how > >> + * distances and masks are handled. > >> + * > >> + * Also note that this is a reflection of sched_domains_numa_masks, which may change > >> + * during the lifetime of the system (offline nodes are taken out of the masks). > >> + */ > > > > Since it's exported, can you declare function parameters and return > > values properly? > > > > I'll add a bit about the return value; what is missing for the parameters? My bad, scratch this. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider 2022-09-25 15:00 ` Yury Norov @ 2022-09-25 18:05 ` Yury Norov 2022-09-25 18:13 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 1 sibling, 2 replies; 25+ messages in thread From: Yury Norov @ 2022-09-25 18:05 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: > Tariq has pointed out that drivers allocating IRQ vectors would benefit > from having smarter NUMA-awareness - cpumask_local_spread() only knows > about the local node and everything outside is in the same bucket. > > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask > of CPUs reachable within a given distance budget), introduce > sched_numa_hop_mask() to export those cpumasks. > > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > include/linux/topology.h | 12 ++++++++++++ > kernel/sched/topology.c | 31 +++++++++++++++++++++++++++++++ > 2 files changed, 43 insertions(+) > > diff --git a/include/linux/topology.h b/include/linux/topology.h > index 4564faafd0e1..3e91ae6d0ad5 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu) > return cpumask_of_node(cpu_to_node(cpu)); > } > > +#ifdef CONFIG_NUMA > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops); > +#else > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) > +{ > + if (node == NUMA_NO_NODE && !hops) > + return cpu_online_mask; > + > + return ERR_PTR(-EOPNOTSUPP); > +} > +#endif /* CONFIG_NUMA */ > + > > #endif /* _LINUX_TOPOLOGY_H */ > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 8739c2a5a54e..ee77706603c0 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) > return found; > } > > +/** > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. > + * @node: The node to count hops from. > + * @hops: Include CPUs up to that many hops away. 0 means local node. > + * > + * Requires rcu_lock to be held. Returned cpumask is only valid within that > + * read-side section, copy it if required beyond that. > + * > + * Note that not all hops are equal in distance; see sched_init_numa() for how > + * distances and masks are handled. > + * > + * Also note that this is a reflection of sched_domains_numa_masks, which may change > + * during the lifetime of the system (offline nodes are taken out of the masks). > + */ > +const struct cpumask *sched_numa_hop_mask(int node, int hops) > +{ > + struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks); > + > + if (node == NUMA_NO_NODE && !hops) > + return cpu_online_mask; > + > + if (node >= nr_node_ids || hops >= sched_domains_numa_levels) > + return ERR_PTR(-EINVAL); This looks like a sanity check. If so, it should go before the snippet above, so that client code would behave consistently. > + > + if (!masks) > + return NULL; In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here you return NULL just to convert it to cpu_online_mask in the caller. This looks inconsistent. So, together with the above comment, this makes me feel that you'd do it like this: const struct cpumask *sched_numa_hop_mask(int node, int hops) { struct cpumask ***masks; if (node >= nr_node_ids || hops >= sched_domains_numa_levels) { #ifdef CONFIG_SCHED_DEBUG pr_err(...); #endif return ERR_PTR(-EINVAL); } if (node == NUMA_NO_NODE && !hops) return cpu_online_mask; /* or NULL */ masks = rcu_dereference(sched_domains_numa_masks); if (!masks) return cpu_online_mask; /* or NULL */ return masks[hops][node]; } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-25 18:05 ` Yury Norov @ 2022-09-25 18:13 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 1 sibling, 0 replies; 25+ messages in thread From: Yury Norov @ 2022-09-25 18:13 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Sun, Sep 25, 2022 at 11:05:18AM -0700, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: > > Tariq has pointed out that drivers allocating IRQ vectors would benefit > > from having smarter NUMA-awareness - cpumask_local_spread() only knows > > about the local node and everything outside is in the same bucket. > > > > sched_domains_numa_masks is pretty much what we want to hand out (a cpumask > > of CPUs reachable within a given distance budget), introduce > > sched_numa_hop_mask() to export those cpumasks. > > > > Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > > --- > > include/linux/topology.h | 12 ++++++++++++ > > kernel/sched/topology.c | 31 +++++++++++++++++++++++++++++++ > > 2 files changed, 43 insertions(+) > > > > diff --git a/include/linux/topology.h b/include/linux/topology.h > > index 4564faafd0e1..3e91ae6d0ad5 100644 > > --- a/include/linux/topology.h > > +++ b/include/linux/topology.h > > @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu) > > return cpumask_of_node(cpu_to_node(cpu)); > > } > > > > +#ifdef CONFIG_NUMA > > +extern const struct cpumask *sched_numa_hop_mask(int node, int hops); > > +#else > > +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) > > +{ > > + if (node == NUMA_NO_NODE && !hops) > > + return cpu_online_mask; > > + > > + return ERR_PTR(-EOPNOTSUPP); > > +} > > +#endif /* CONFIG_NUMA */ > > + > > > > #endif /* _LINUX_TOPOLOGY_H */ > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 8739c2a5a54e..ee77706603c0 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2067,6 +2067,37 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu) > > return found; > > } > > > > +/** > > + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away. > > + * @node: The node to count hops from. > > + * @hops: Include CPUs up to that many hops away. 0 means local node. > > + * > > + * Requires rcu_lock to be held. Returned cpumask is only valid within that > > + * read-side section, copy it if required beyond that. > > + * > > + * Note that not all hops are equal in distance; see sched_init_numa() for how > > + * distances and masks are handled. > > + * > > + * Also note that this is a reflection of sched_domains_numa_masks, which may change > > + * during the lifetime of the system (offline nodes are taken out of the masks). > > + */ > > +const struct cpumask *sched_numa_hop_mask(int node, int hops) > > +{ > > + struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks); > > + > > + if (node == NUMA_NO_NODE && !hops) > > + return cpu_online_mask; > > + > > + if (node >= nr_node_ids || hops >= sched_domains_numa_levels) > > + return ERR_PTR(-EINVAL); > > This looks like a sanity check. If so, it should go before the snippet > above, so that client code would behave consistently. > > > + > > + if (!masks) > > + return NULL; > > In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here > you return NULL just to convert it to cpu_online_mask in the caller. > This looks inconsistent. So, together with the above comment, this > makes me feel that you'd do it like this: > > const struct cpumask *sched_numa_hop_mask(int node, int hops) > { > struct cpumask ***masks; > > if (node >= nr_node_ids || hops >= sched_domains_numa_levels) > { > #ifdef CONFIG_SCHED_DEBUG > pr_err(...); > #endif > return ERR_PTR(-EINVAL); > } It's an exported function, and any lame driver may crash the system by dereferencing a random pointer. You need to check the node for -2, -3, etc, because only -1 is a valid negative value. For hops, it should be an unsigned int. Right? > > if (node == NUMA_NO_NODE && !hops) > return cpu_online_mask; /* or NULL */ > > masks = rcu_dereference(sched_domains_numa_masks); > if (!masks) > return cpu_online_mask; /* or NULL */ > > return masks[hops][node]; > } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() 2022-09-25 18:05 ` Yury Norov 2022-09-25 18:13 ` Yury Norov @ 2022-09-27 16:45 ` Valentin Schneider 1 sibling, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw) To: Yury Norov Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 25/09/22 11:05, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:40PM +0100, Valentin Schneider wrote: >> +const struct cpumask *sched_numa_hop_mask(int node, int hops) >> +{ >> + struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks); >> + >> + if (node == NUMA_NO_NODE && !hops) >> + return cpu_online_mask; >> + >> + if (node >= nr_node_ids || hops >= sched_domains_numa_levels) >> + return ERR_PTR(-EINVAL); > > This looks like a sanity check. If so, it should go before the snippet > above, so that client code would behave consistently. > nr_node_ids is unsigned, so -1 >= nr_node_ids is true. >> + >> + if (!masks) >> + return NULL; > > In (node == NUMA_NO_NODE && !hops) case you return online cpus. Here > you return NULL just to convert it to cpu_online_mask in the caller. > This looks inconsistent. So, together with the above comment, this > makes me feel that you'd do it like this: > > const struct cpumask *sched_numa_hop_mask(int node, int hops) > { > struct cpumask ***masks; > > if (node >= nr_node_ids || hops >= sched_domains_numa_levels) > { > #ifdef CONFIG_SCHED_DEBUG > pr_err(...); > #endif > return ERR_PTR(-EINVAL); > } > > if (node == NUMA_NO_NODE && !hops) > return cpu_online_mask; /* or NULL */ > > masks = rcu_dereference(sched_domains_numa_masks); > if (!masks) > return cpu_online_mask; /* or NULL */ > > return masks[hops][node]; > } If we're being pedantic, sched_numa_hop_mask() shouldn't return cpu_online_mask in those cases, but that was the least horrible option I found to get something sensible for the NUMA_NO_NODE / !CONFIG_NUMA case. I might be able to better handle this with your suggestion of having a mask iterator. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (5 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-25 14:58 ` Yury Norov 2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider ` (2 subsequent siblings) 9 siblings, 1 reply; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs reachable within a given distance budget, but this means each successive cpumask is a superset of the previous one. Code wanting to allocate one item per CPU (e.g. IRQs) at increasing distances would thus need to allocate a temporary cpumask to note which CPUs have already been visited. This can be prevented by leveraging for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro. Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/include/linux/topology.h b/include/linux/topology.h index 3e91ae6d0ad5..7aa7e6a4c739 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -257,5 +257,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) } #endif /* CONFIG_NUMA */ +/** + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance, + * starting from a given node. + * @cpu: the iteration variable. + * @node: the NUMA node to start the search from. + * + * Requires rcu_lock to be held. + * Careful: this is a double loop, 'break' won't work as expected. + * + * + * Implementation notes: + * + * Providing it is valid, the mask returned by + * sched_numa_hop_mask(node, hops+1) + * is a superset of the one returned by + * sched_numa_hop_mask(node, hops) + * which may not be that useful for drivers that try to spread things out and + * want to visit a CPU not more than once. + * + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus + * of sched_numa_hop_mask(node, hops+1) with the CPUs of + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs + * a given distance away (rather than *up to* a given distance). + * + * hops=0 forces us to play silly games: we pass cpu_none_mask to + * for_each_cpu_andnot(), which turns it into for_each_cpu(). + */ +#define for_each_numa_hop_cpu(cpu, node) \ + for (struct { const struct cpumask *curr, *prev; int hops; } __v = \ + { sched_numa_hop_mask(node, 0), NULL, 0 }; \ + !IS_ERR_OR_NULL(__v.curr); \ + __v.hops++, \ + __v.prev = __v.curr, \ + __v.curr = sched_numa_hop_mask(node, __v.hops)) \ + for_each_cpu_andnot(cpu, \ + __v.curr, \ + __v.hops ? __v.prev : cpu_none_mask) #endif /* _LINUX_TOPOLOGY_H */ -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() 2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider @ 2022-09-25 14:58 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 0 siblings, 1 reply; 25+ messages in thread From: Yury Norov @ 2022-09-25 14:58 UTC (permalink / raw) To: Valentin Schneider Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On Fri, Sep 23, 2022 at 04:55:41PM +0100, Valentin Schneider wrote: > The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs > reachable within a given distance budget, but this means each successive > cpumask is a superset of the previous one. > > Code wanting to allocate one item per CPU (e.g. IRQs) at increasing > distances would thus need to allocate a temporary cpumask to note which > CPUs have already been visited. This can be prevented by leveraging > for_each_cpu_andnot() - package all that logic into one ugl^D fancy macro. > > Signed-off-by: Valentin Schneider <vschneid@redhat.com> > --- > include/linux/topology.h | 37 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/include/linux/topology.h b/include/linux/topology.h > index 3e91ae6d0ad5..7aa7e6a4c739 100644 > --- a/include/linux/topology.h > +++ b/include/linux/topology.h > @@ -257,5 +257,42 @@ static inline const struct cpumask *sched_numa_hop_mask(int node, int hops) > } > #endif /* CONFIG_NUMA */ > > +/** > + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance, > + * starting from a given node. > + * @cpu: the iteration variable. > + * @node: the NUMA node to start the search from. > + * > + * Requires rcu_lock to be held. > + * Careful: this is a double loop, 'break' won't work as expected. This warning concerns me not only because new iteration loop hides complexity and breaks 'break' (sic!), but also because it looks too specific. Why don't you split it, so instead: for_each_numa_hop_cpu(cpu, dev->priv.numa_node) { cpus[i] = cpu; if (++i == ncomp_eqs) goto spread_done; } in the following patch you would have something like this: for_each_node_hop(hop, node) { struct cpumask hop_cpus = sched_numa_hop_mask(node, hop); for_each_cpu_andnot(cpu, hop_cpus, ...) { cpus[i] = cpu; if (++i == ncomp_eqs) goto spread_done; } } It looks more bulky, but I believe there will be more users for for_each_node_hop() alone. On top of that, if you really like it, you can implement for_each_numa_hop_cpu() if you want. > + * Implementation notes: > + * > + * Providing it is valid, the mask returned by > + * sched_numa_hop_mask(node, hops+1) > + * is a superset of the one returned by > + * sched_numa_hop_mask(node, hops) > + * which may not be that useful for drivers that try to spread things out and > + * want to visit a CPU not more than once. > + * > + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus > + * of sched_numa_hop_mask(node, hops+1) with the CPUs of > + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs > + * a given distance away (rather than *up to* a given distance). > + * > + * hops=0 forces us to play silly games: we pass cpu_none_mask to > + * for_each_cpu_andnot(), which turns it into for_each_cpu(). > + */ > +#define for_each_numa_hop_cpu(cpu, node) \ > + for (struct { const struct cpumask *curr, *prev; int hops; } __v = \ > + { sched_numa_hop_mask(node, 0), NULL, 0 }; \ This anonymous structure is never used as structure. What for you define it? Why not just declare hops, prev and curr without packing them? Thanks, Yury > + !IS_ERR_OR_NULL(__v.curr); \ > + __v.hops++, \ > + __v.prev = __v.curr, \ > + __v.curr = sched_numa_hop_mask(node, __v.hops)) \ > + for_each_cpu_andnot(cpu, \ > + __v.curr, \ > + __v.hops ? __v.prev : cpu_none_mask) > > #endif /* _LINUX_TOPOLOGY_H */ > -- > 2.31.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() 2022-09-25 14:58 ` Yury Norov @ 2022-09-27 16:45 ` Valentin Schneider 0 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-27 16:45 UTC (permalink / raw) To: Yury Norov Cc: netdev, linux-rdma, linux-kernel, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 25/09/22 07:58, Yury Norov wrote: > On Fri, Sep 23, 2022 at 04:55:41PM +0100, Valentin Schneider wrote: >> +/** >> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance, >> + * starting from a given node. >> + * @cpu: the iteration variable. >> + * @node: the NUMA node to start the search from. >> + * >> + * Requires rcu_lock to be held. >> + * Careful: this is a double loop, 'break' won't work as expected. > > This warning concerns me not only because new iteration loop hides > complexity and breaks 'break' (sic!), but also because it looks too > specific. Why don't you split it, so instead: > > for_each_numa_hop_cpu(cpu, dev->priv.numa_node) { > cpus[i] = cpu; > if (++i == ncomp_eqs) > goto spread_done; > } > > in the following patch you would have something like this: > > for_each_node_hop(hop, node) { > struct cpumask hop_cpus = sched_numa_hop_mask(node, hop); > > for_each_cpu_andnot(cpu, hop_cpus, ...) { > cpus[i] = cpu; > if (++i == ncomp_eqs) > goto spread_done; > } > } > > It looks more bulky, but I believe there will be more users for > for_each_node_hop() alone. > > On top of that, if you really like it, you can implement > for_each_numa_hop_cpu() if you want. > IIUC you're suggesting to introduce an iterator for the cpumasks first, and then maybe add one on top for the individual cpus. I'm happy to do that, though I have to say I'm keen to keep the CPU iterator - IMO the complexity is justified if it is centralized in one location and saves us from boring old boilerplate code. >> + * Implementation notes: >> + * >> + * Providing it is valid, the mask returned by >> + * sched_numa_hop_mask(node, hops+1) >> + * is a superset of the one returned by >> + * sched_numa_hop_mask(node, hops) >> + * which may not be that useful for drivers that try to spread things out and >> + * want to visit a CPU not more than once. >> + * >> + * To accommodate for that, we use for_each_cpu_andnot() to iterate over the cpus >> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of >> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs >> + * a given distance away (rather than *up to* a given distance). >> + * >> + * hops=0 forces us to play silly games: we pass cpu_none_mask to >> + * for_each_cpu_andnot(), which turns it into for_each_cpu(). >> + */ >> +#define for_each_numa_hop_cpu(cpu, node) \ >> + for (struct { const struct cpumask *curr, *prev; int hops; } __v = \ >> + { sched_numa_hop_mask(node, 0), NULL, 0 }; \ > > This anonymous structure is never used as structure. What for you > define it? Why not just declare hops, prev and curr without packing > them? > I haven't found a way to do this that doesn't involve a struct - apparently you can't mix types in a for loop declaration clause. > Thanks, > Yury > ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (6 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider @ 2022-09-23 15:55 ` Valentin Schneider 2022-09-25 7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan 2022-10-18 6:36 ` Tariq Toukan 9 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-09-23 15:55 UTC (permalink / raw) To: netdev, linux-rdma, linux-kernel Cc: Tariq Toukan, Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Jesse Brandeburg From: Tariq Toukan <tariqt@nvidia.com> In the IRQ affinity hints, replace the binary NUMA preference (local / remote) with the improved for_each_numa_hop_cpu() API that minds the actual distances, so that remote NUMAs with short distance are preferred over farther ones. This has significant performance implications when using NUMA-aware allocated memory (follow [1] and derivatives for example). [1] drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel() int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix)); Performance tests: TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on). Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 +-------------------------+-----------+------------------+------------------+ | | BW (Gbps) | TX side CPU util | RX side CPU util | +-------------------------+-----------+------------------+------------------+ | Baseline | 52.3 | 6.4 % | 17.9 % | +-------------------------+-----------+------------------+------------------+ | Applied on TX side only | 52.6 | 5.2 % | 18.5 % | +-------------------------+-----------+------------------+------------------+ | Applied on RX side only | 94.9 | 11.9 % | 27.2 % | +-------------------------+-----------+------------------+------------------+ | Applied on both sides | 95.1 | 8.4 % | 27.3 % | +-------------------------+-----------+------------------+------------------+ Bottleneck in RX side is released, reached linerate (~1.8x speedup). ~30% less cpu util on TX. * CPU util on active cores only. Setups details (similar for both sides): NIC: ConnectX6-DX dual port, 100 Gbps each. Single port used in the tests. $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 16 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7763 64-Core Processor Stepping: 1 CPU MHz: 2594.804 BogoMIPS: 4890.73 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-7,128-135 NUMA node1 CPU(s): 8-15,136-143 NUMA node2 CPU(s): 16-23,144-151 NUMA node3 CPU(s): 24-31,152-159 NUMA node4 CPU(s): 32-39,160-167 NUMA node5 CPU(s): 40-47,168-175 NUMA node6 CPU(s): 48-55,176-183 NUMA node7 CPU(s): 56-63,184-191 NUMA node8 CPU(s): 64-71,192-199 NUMA node9 CPU(s): 72-79,200-207 NUMA node10 CPU(s): 80-87,208-215 NUMA node11 CPU(s): 88-95,216-223 NUMA node12 CPU(s): 96-103,224-231 NUMA node13 CPU(s): 104-111,232-239 NUMA node14 CPU(s): 112-119,240-247 NUMA node15 CPU(s): 120-127,248-255 .. $ numactl -H .. node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 11 11 11 12 12 12 12 32 32 32 32 32 32 32 32 1: 11 10 11 11 12 12 12 12 32 32 32 32 32 32 32 32 2: 11 11 10 11 12 12 12 12 32 32 32 32 32 32 32 32 3: 11 11 11 10 12 12 12 12 32 32 32 32 32 32 32 32 4: 12 12 12 12 10 11 11 11 32 32 32 32 32 32 32 32 5: 12 12 12 12 11 10 11 11 32 32 32 32 32 32 32 32 6: 12 12 12 12 11 11 10 11 32 32 32 32 32 32 32 32 7: 12 12 12 12 11 11 11 10 32 32 32 32 32 32 32 32 8: 32 32 32 32 32 32 32 32 10 11 11 11 12 12 12 12 9: 32 32 32 32 32 32 32 32 11 10 11 11 12 12 12 12 10: 32 32 32 32 32 32 32 32 11 11 10 11 12 12 12 12 11: 32 32 32 32 32 32 32 32 11 11 11 10 12 12 12 12 12: 32 32 32 32 32 32 32 32 12 12 12 12 10 11 11 11 13: 32 32 32 32 32 32 32 32 12 12 12 12 11 10 11 11 14: 32 32 32 32 32 32 32 32 12 12 12 12 11 11 10 11 15: 32 32 32 32 32 32 32 32 12 12 12 12 11 11 11 10 $ cat /sys/class/net/ens5f0/device/numa_node 14 Affinity hints (127 IRQs): Before: 331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000 332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000 333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000 334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000 335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000 336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000 337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000 338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000 339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002 349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004 350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008 351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010 352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020 353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040 354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080 355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100 356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200 357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400 358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800 359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000 360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000 361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000 362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000 363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000 364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000 365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000 366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000 367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000 368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000 369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000 370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000 371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000 372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000 373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000 374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000 375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000 376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000 377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000 378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000 379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000 380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000 381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000 382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000 383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000 384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000 385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000 386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000 387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000 388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000 389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000 390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000 391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000 392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000 393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000 394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000 395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000 396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000 397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000 398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000 399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000 400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000 401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000 402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000 403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000 404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000 405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000 406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000 407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000 408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000 409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000 410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000 411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000 412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000 413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000 414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000 415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000 416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000 417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000 418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000 419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000 420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000 421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000 422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000 423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000 424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000 425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000 426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000 427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000 428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000 429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000 430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000 431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000 432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000 433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000 434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000 435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000 436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000 437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000 438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000 439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000 440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000 441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000 442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000 443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000 444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000 445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000 446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000 447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000 448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000 449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000 450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000 451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000 452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000 453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000 454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000 455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000 456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000 457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000 After: 331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000 332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000 333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000 334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000 335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000 336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000 337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000 338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000 339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000 348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000 349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000 350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000 351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000 352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000 353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000 354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000 355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000 356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000 357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000 358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000 359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000 360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000 361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000 362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000 363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000 364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000 365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000 366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000 367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000 368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000 369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000 370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000 371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000 372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000 373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000 374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000 375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000 376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000 377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000 378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000 379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000 380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000 381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000 382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000 383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000 395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000 396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000 397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000 398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000 399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000 400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000 401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000 402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000 403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000 404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000 405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000 406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000 407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000 408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000 409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000 410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000 411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000 412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000 413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000 414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000 415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000 416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000 417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000 418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000 419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000 420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000 421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000 422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000 423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000 424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000 425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000 426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000 427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000 428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000 429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000 430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000 431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000 432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000 433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000 434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000 435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000 436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000 437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000 438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000 439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000 440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000 441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000 442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000 443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000 444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000 445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000 446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000 447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000 448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000 449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000 450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000 451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000 452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000 453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000 454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000 455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000 456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000 457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000 Signed-off-by: Tariq Toukan <tariqt@nvidia.com> --- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index 229728c80233..4818cc0c9bc3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev) int ncomp_eqs = table->num_comp_eqs; u16 *cpus; int ret; + int cpu; int i; ncomp_eqs = table->num_comp_eqs; @@ -830,8 +831,16 @@ static int comp_irqs_request(struct mlx5_core_dev *dev) ret = -ENOMEM; goto free_irqs; } - for (i = 0; i < ncomp_eqs; i++) - cpus[i] = cpumask_local_spread(i, dev->priv.numa_node); + + i = 0; + rcu_read_lock(); + for_each_numa_hop_cpu(cpu, dev->priv.numa_node) { + cpus[i] = cpu; + if (++i == ncomp_eqs) + goto spread_done; + } +spread_done: + rcu_read_unlock(); ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs); kfree(cpus); if (ret < 0) -- 2.31.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (7 preceding siblings ...) 2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider @ 2022-09-25 7:48 ` Tariq Toukan 2022-10-18 6:36 ` Tariq Toukan 9 siblings, 0 replies; 25+ messages in thread From: Tariq Toukan @ 2022-09-25 7:48 UTC (permalink / raw) To: Valentin Schneider, netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 9/23/2022 4:25 PM, Valentin Schneider wrote: > Hi folks, > > Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit > from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut > it). > > The proposed interface involved an array of CPUs and a temporary cpumask, and > being my difficult self what I'm proposing here is an interface that doesn't > require any temporary storage other than some stack variables (at the cost of > one wild macro). > > Please note that this is based on top of Yury's bitmap-for-next [2] to leverage > his fancy new FIND_NEXT_BIT() macro. > > [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/ > [2]: https://github.com/norov/linux.git/ -b bitmap-for-next > > A note on treewide use of for_each_cpu_andnot() > =============================================== > > I've used the below coccinelle script to find places that could be patched (I > couldn't figure out the valid syntax to patch from coccinelle itself): > > ,----- > @tmpandnot@ > expression tmpmask; > iterator for_each_cpu; > position p; > statement S; > @@ > cpumask_andnot(tmpmask, ...); > > ... > > ( > for_each_cpu@p(..., tmpmask, ...) > S > | > for_each_cpu@p(..., tmpmask, ...) > { > ... > } > ) > > @script:python depends on tmpandnot@ > p << tmpandnot.p; > @@ > coccilib.report.print_report(p[0], "andnot loop here") > '----- > > Which yields (against c40e8341e3b3): > > .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here > .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here > .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here > .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here > .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here > .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here > .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here > .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here > .//kernel/sched/core.c:345:1-13: andnot loop here > .//kernel/sched/core.c:366:1-13: andnot loop here > .//net/core/dev.c:3058:1-13: andnot loop here > > A lot of those are actually of the shape > > for_each_cpu(cpu, mask) { > ... > cpumask_andnot(mask, ...); > } > > I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(), > but I decided to just stick to the one obvious one in __sched_core_flip(). > > Revisions > ========= > > v3 -> v4 > ++++++++ > > o Rebased on top of Yury's bitmap-for-next > o Added Tariq's mlx5e patch > o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE && > hops=0 case > > v2 -> v3 > ++++++++ > > o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury) > o New patches to fix issues raised by running the above > > o New patch to use for_each_cpu_andnot() in sched/core.c (Yury) > > v1 -> v2 > ++++++++ > > o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury) > o Rebase onto v6.0-rc1 > > Cheers, > Valentin > > Tariq Toukan (1): > net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity > hints > > Valentin Schneider (6): > lib/find_bit: Introduce find_next_andnot_bit() > cpumask: Introduce for_each_cpu_andnot() > lib/test_cpumask: Add for_each_cpu_and(not) tests > sched/core: Merge cpumask_andnot()+for_each_cpu() into > for_each_cpu_andnot() > sched/topology: Introduce sched_numa_hop_mask() > sched/topology: Introduce for_each_numa_hop_cpu() > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++- > include/linux/cpumask.h | 39 ++++++++++++++++ > include/linux/find.h | 33 +++++++++++++ > include/linux/topology.h | 49 ++++++++++++++++++++ > kernel/sched/core.c | 5 +- > kernel/sched/topology.c | 31 +++++++++++++ > lib/cpumask_kunit.c | 19 ++++++++ > lib/find_bit.c | 9 ++++ > 8 files changed, 192 insertions(+), 6 deletions(-) > > -- > 2.31.1 > Valentin, thank you for investing your time here. Acked-by: Tariq Toukan <tariqt@nvidia.com> Tested on my mlx5 environment. It works as expected, including the case of node == NUMA_NO_NODE. Regards, Tariq ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider ` (8 preceding siblings ...) 2022-09-25 7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan @ 2022-10-18 6:36 ` Tariq Toukan 2022-10-18 16:50 ` Valentin Schneider 9 siblings, 1 reply; 25+ messages in thread From: Tariq Toukan @ 2022-10-18 6:36 UTC (permalink / raw) To: Valentin Schneider, netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 9/23/2022 4:25 PM, Valentin Schneider wrote: > Hi folks, > > Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit > from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut > it). > > The proposed interface involved an array of CPUs and a temporary cpumask, and > being my difficult self what I'm proposing here is an interface that doesn't > require any temporary storage other than some stack variables (at the cost of > one wild macro). > > Please note that this is based on top of Yury's bitmap-for-next [2] to leverage > his fancy new FIND_NEXT_BIT() macro. > > [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/ > [2]: https://github.com/norov/linux.git/ -b bitmap-for-next > > A note on treewide use of for_each_cpu_andnot() > =============================================== > > I've used the below coccinelle script to find places that could be patched (I > couldn't figure out the valid syntax to patch from coccinelle itself): > > ,----- > @tmpandnot@ > expression tmpmask; > iterator for_each_cpu; > position p; > statement S; > @@ > cpumask_andnot(tmpmask, ...); > > ... > > ( > for_each_cpu@p(..., tmpmask, ...) > S > | > for_each_cpu@p(..., tmpmask, ...) > { > ... > } > ) > > @script:python depends on tmpandnot@ > p << tmpandnot.p; > @@ > coccilib.report.print_report(p[0], "andnot loop here") > '----- > > Which yields (against c40e8341e3b3): > > .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here > .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here > .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here > .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here > .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here > .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here > .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here > .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here > .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here > .//kernel/sched/core.c:345:1-13: andnot loop here > .//kernel/sched/core.c:366:1-13: andnot loop here > .//net/core/dev.c:3058:1-13: andnot loop here > > A lot of those are actually of the shape > > for_each_cpu(cpu, mask) { > ... > cpumask_andnot(mask, ...); > } > > I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(), > but I decided to just stick to the one obvious one in __sched_core_flip(). > > Revisions > ========= > > v3 -> v4 > ++++++++ > > o Rebased on top of Yury's bitmap-for-next > o Added Tariq's mlx5e patch > o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE && > hops=0 case > > v2 -> v3 > ++++++++ > > o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury) > o New patches to fix issues raised by running the above > > o New patch to use for_each_cpu_andnot() in sched/core.c (Yury) > > v1 -> v2 > ++++++++ > > o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury) > o Rebase onto v6.0-rc1 > > Cheers, > Valentin > Hi, What's the status of this? Do we have agreement on the changes needed for the next respin? Regards, Tariq ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface 2022-10-18 6:36 ` Tariq Toukan @ 2022-10-18 16:50 ` Valentin Schneider 0 siblings, 0 replies; 25+ messages in thread From: Valentin Schneider @ 2022-10-18 16:50 UTC (permalink / raw) To: Tariq Toukan, netdev, linux-rdma, linux-kernel Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Yury Norov, Andy Shevchenko, Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Mel Gorman, Greg Kroah-Hartman, Heiko Carstens, Tony Luck, Jonathan Cameron, Gal Pressman, Tariq Toukan, Jesse Brandeburg On 18/10/22 09:36, Tariq Toukan wrote: > > Hi, > > What's the status of this? > Do we have agreement on the changes needed for the next respin? > Yep, the bitmap patches are in 6.1-rc1, I need to respin the topology ones to address Yury's comments. It's in my todolist, I'll get to it soonish. > Regards, > Tariq ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2022-10-18 16:50 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-23 13:25 [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Valentin Schneider 2022-09-23 13:25 ` [PATCH v4 1/7] lib/find_bit: Introduce find_next_andnot_bit() Valentin Schneider 2022-09-23 15:44 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Yury Norov 2022-09-23 15:49 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 2/7] cpumask: Introduce for_each_cpu_andnot() Valentin Schneider 2022-09-25 15:23 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 2022-09-27 20:02 ` Yury Norov 2022-09-23 15:55 ` [PATCH v4 3/7] lib/test_cpumask: Add for_each_cpu_and(not) tests Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 4/7] sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 5/7] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider 2022-09-25 15:00 ` Yury Norov 2022-09-25 15:24 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 2022-09-27 19:30 ` Yury Norov 2022-09-25 18:05 ` Yury Norov 2022-09-25 18:13 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 6/7] sched/topology: Introduce for_each_numa_hop_cpu() Valentin Schneider 2022-09-25 14:58 ` Yury Norov 2022-09-27 16:45 ` Valentin Schneider 2022-09-23 15:55 ` [PATCH v4 7/7] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Valentin Schneider 2022-09-25 7:48 ` [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface Tariq Toukan 2022-10-18 6:36 ` Tariq Toukan 2022-10-18 16:50 ` Valentin Schneider
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).