linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
@ 2022-11-12 19:09 Yury Norov
  2022-11-12 19:09 ` [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
Improve remote NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to their measurements, for mlx5e:

        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
v2: 
 - use bsearch() in sched_numa_find_nth_cpu();
 - fix missing 'static inline' in 3rd patch.

Yury Norov (4):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality

 include/linux/cpumask.h  | 20 +++++++++++++++
 include/linux/find.h     | 33 ++++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++
 kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 12 ++-------
 lib/find_bit.c           |  9 +++++++
 6 files changed, 127 insertions(+), 10 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit
  2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
@ 2022-11-12 19:09 ` Yury Norov
  2022-11-12 19:09 ` [PATCH v2 2/4] cpumask: introduce cpumask_nth_and_andnot Yury Norov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

The function is used to implement in-place bitmaps traversing without
storing intermediate result in temporary bitmaps, in the following patches.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
 lib/find_bit.c       |  9 +++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index ccaf61a0f5fd..7a97b492b447 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -22,6 +22,9 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
 				unsigned long size, unsigned long n);
 unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long size, unsigned long n);
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					const unsigned long *addr3, unsigned long size,
+					unsigned long n);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -255,6 +258,36 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
 	return __find_nth_andnot_bit(addr1, addr2, size, n);
 }
 
+/**
+ * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
+ *			     excluding those set in 3rd region
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @addr3: The 3rd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th set bit.
+ * If no such, returns @size.
+ */
+static inline
+unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
+					const unsigned long *addr2,
+					const unsigned long *addr3,
+					unsigned long size, unsigned long n)
+{
+	if (n >= size)
+		return size;
+
+	if (small_const_nbits(size)) {
+		unsigned long val =  *addr1 & *addr2 & (~*addr2) & GENMASK(size - 1, 0);
+
+		return val ? fns(val, n) : size;
+	}
+
+	return __find_nth_and_andnot_bit(addr1, addr2, addr3, size, n);
+}
+
 #ifndef find_first_and_bit
 /**
  * find_first_and_bit - find the first set bit in both memory regions
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18bc0a7ac8ee..c10920e66788 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -155,6 +155,15 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
 }
 EXPORT_SYMBOL(__find_nth_andnot_bit);
 
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
+					const unsigned long *addr2,
+					const unsigned long *addr3,
+					unsigned long size, unsigned long n)
+{
+	return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_and_andnot_bit);
+
 #ifndef find_next_and_bit
 unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/4] cpumask: introduce cpumask_nth_and_andnot
  2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-12 19:09 ` [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
@ 2022-11-12 19:09 ` Yury Norov
  2022-11-12 19:09 ` [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c2aa0aa26b45..debfa2261569 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -391,6 +391,26 @@ unsigned int cpumask_nth_andnot(unsigned int cpu, const struct cpumask *srcp1,
 				nr_cpumask_bits, cpumask_check(cpu));
 }
 
+/**
+ * cpumask_nth_and_andnot - get the Nth cpu set in 1st and 2nd cpumask, and clear in 3rd.
+ * @srcp1: the cpumask pointer
+ * @srcp2: the cpumask pointer
+ * @srcp3: the cpumask pointer
+ * @cpu: the N'th cpu to find, starting from 0
+ *
+ * Returns >= nr_cpu_ids if such cpu doesn't exist.
+ */
+static inline
+unsigned int cpumask_nth_and_andnot(unsigned int cpu, const struct cpumask *srcp1,
+							const struct cpumask *srcp2,
+							const struct cpumask *srcp3)
+{
+	return find_nth_and_andnot_bit(cpumask_bits(srcp1),
+					cpumask_bits(srcp2),
+					cpumask_bits(srcp3),
+					nr_cpumask_bits, cpumask_check(cpu));
+}
+
 #define CPU_BITS_NONE						\
 {								\
 	[0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL			\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-12 19:09 ` [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
  2022-11-12 19:09 ` [PATCH v2 2/4] cpumask: introduce cpumask_nth_and_andnot Yury Norov
@ 2022-11-12 19:09 ` Yury Norov
  2022-11-14 14:32   ` Andy Shevchenko
  2022-11-15 17:25   ` Valentin Schneider
  2022-11-12 19:09 ` [PATCH v2 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-15 17:24 ` [PATCH v2 0/4] " Valentin Schneider
  4 siblings, 2 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/topology.h |  8 ++++++
 kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..b2e87728caea 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+#else
+static inline int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+	return cpumask_nth(cpu, cpus);
+}
+#endif	/* CONFIG_NUMA */
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..024f1da0e941 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1764,6 +1764,8 @@ bool find_numa_distance(int distance)
  *   there is an intermediary node C, which is < N hops away from both
  *   nodes A and B, the system is a glueless mesh.
  */
+#include <linux/bsearch.h>
+
 static void init_numa_topology_type(int offline_node)
 {
 	int a, b, c, n;
@@ -2067,6 +2069,59 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+struct __cmp_key {
+	const struct cpumask *cpus;
+	struct cpumask ***masks;
+	int node;
+	int cpu;
+	int w;
+};
+
+static int cmp(const void *a, const void *b)
+{
+	struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
+	struct cpumask **cur_hop = *(struct cpumask ***)b;
+	struct __cmp_key *k = (struct __cmp_key *)a;
+
+	if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
+		return 1;
+
+	k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
+	if (k->w <= k->cpu)
+		return 0;
+
+	return -1;
+}
+
+/*
+ * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
+ *                             closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: Nth cpu to find
+ *
+ * returns: cpu, or nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };
+	int hop, ret = nr_cpu_ids;
+
+	rcu_read_lock();
+	k.masks = rcu_dereference(sched_domains_numa_masks);
+	if (!k.masks)
+		goto unlock;
+
+	hop = (struct cpumask ***)
+		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;
+
+	ret = hop ?
+		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
+		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);
+unlock:
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
                   ` (2 preceding siblings ...)
  2022-11-12 19:09 ` [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
@ 2022-11-12 19:09 ` Yury Norov
  2022-11-15 17:24 ` [PATCH v2 0/4] " Valentin Schneider
  4 siblings, 0 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

Switch cpumask_local_spread() to use newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/cpumask.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index c7c392514fd3..255974cd6734 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -110,7 +110,7 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 #endif
 
 /**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * cpumask_local_spread - select the i'th cpu based on NUMA distances
  * @i: index number
  * @node: local numa_node
  *
@@ -132,15 +132,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
 		if (cpu < nr_cpu_ids)
 			return cpu;
 	} else {
-		/* NUMA first. */
-		cpu = cpumask_nth_and(i, cpu_online_mask, cpumask_of_node(node));
-		if (cpu < nr_cpu_ids)
-			return cpu;
-
-		i -= cpumask_weight_and(cpu_online_mask, cpumask_of_node(node));
-
-		/* Skip NUMA nodes, done above. */
-		cpu = cpumask_nth_andnot(i, cpu_online_mask, cpumask_of_node(node));
+		cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
 		if (cpu < nr_cpu_ids)
 			return cpu;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-12 19:09 ` [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
@ 2022-11-14 14:32   ` Andy Shevchenko
  2022-11-14 15:02     ` Andy Shevchenko
  2022-12-08  2:55     ` Yury Norov
  2022-11-15 17:25   ` Valentin Schneider
  1 sibling, 2 replies; 16+ messages in thread
From: Andy Shevchenko @ 2022-11-14 14:32 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Sat, Nov 12, 2022 at 11:09:45AM -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
> 
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.

...

> +struct __cmp_key {
> +	const struct cpumask *cpus;
> +	struct cpumask ***masks;
> +	int node;
> +	int cpu;
> +	int w;
> +};
> +
> +static int cmp(const void *a, const void *b)

Calling them key and pivot (as in the caller), would make more sense.

> +{

What about

	const (?) struct cpumask ***masks = (...)pivot;

> +	struct cpumask **prev_hop = *((struct cpumask ***)b - 1);

	= masks[-1];

> +	struct cpumask **cur_hop = *(struct cpumask ***)b;

	= masks[0];

?

> +	struct __cmp_key *k = (struct __cmp_key *)a;

> +	if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
> +		return 1;

> +	k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
> +	if (k->w <= k->cpu)
> +		return 0;

Can k->cpu be negative? If no, we can rewrite above as

	k->w = 0;
	if (b == k->masks)
		return 0;

	k->w = cpumask_weight_and(k->cpus, prev_hop[k->node]);

> +	return -1;
> +}

...

> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> +	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };

You can drop NULL and 0 while using C99 assignments.

> +	int hop, ret = nr_cpu_ids;

> +	rcu_read_lock();

+ Blank line?

> +	k.masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!k.masks)
> +		goto unlock;

> +	hop = (struct cpumask ***)
> +		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;

Strange indentation. I would rather see the split on parameters and
maybe '-' operator.

sizeof(*k.masks) is a bit shorter, right?

Also we may go with


	struct cpumask ***masks;
	struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };



> +	ret = hop ?
> +		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
> +		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);

> +unlock:

out_unlock: shows the intention more clearly, no?

> +	rcu_read_unlock();
> +	return ret;
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-14 14:32   ` Andy Shevchenko
@ 2022-11-14 15:02     ` Andy Shevchenko
  2022-12-08  2:55     ` Yury Norov
  1 sibling, 0 replies; 16+ messages in thread
From: Andy Shevchenko @ 2022-11-14 15:02 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Mon, Nov 14, 2022 at 04:32:10PM +0200, Andy Shevchenko wrote:
> On Sat, Nov 12, 2022 at 11:09:45AM -0800, Yury Norov wrote:
> > The function finds Nth set CPU in a given cpumask starting from a given
> > node.
> > 
> > Leveraging the fact that each hop in sched_domains_numa_masks includes the
> > same or greater number of CPUs than the previous one, we can use binary
> > search on hops instead of linear walk, which makes the overall complexity
> > of O(log n) in terms of number of cpumask_weight() calls.
> 
> ...
> 
> > +struct __cmp_key {
> > +	const struct cpumask *cpus;
> > +	struct cpumask ***masks;
> > +	int node;
> > +	int cpu;
> > +	int w;
> > +};
> > +
> > +static int cmp(const void *a, const void *b)
> 
> Calling them key and pivot (as in the caller), would make more sense.
> 
> > +{
> 
> What about
> 
> 	const (?) struct cpumask ***masks = (...)pivot;
> 
> > +	struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
> 
> 	= masks[-1];
> 
> > +	struct cpumask **cur_hop = *(struct cpumask ***)b;
> 
> 	= masks[0];
> 
> ?
> 
> > +	struct __cmp_key *k = (struct __cmp_key *)a;
> 
> > +	if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
> > +		return 1;
> 
> > +	k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
> > +	if (k->w <= k->cpu)
> > +		return 0;
> 
> Can k->cpu be negative? If no, we can rewrite above as
> 
> 	k->w = 0;
> 	if (b == k->masks)
> 		return 0;
> 
> 	k->w = cpumask_weight_and(k->cpus, prev_hop[k->node]);
> 
> > +	return -1;
> > +}
> 
> ...
> 
> > +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > +{
> > +	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };
> 
> You can drop NULL and 0 while using C99 assignments.
> 
> > +	int hop, ret = nr_cpu_ids;
> 
> > +	rcu_read_lock();
> 
> + Blank line?
> 
> > +	k.masks = rcu_dereference(sched_domains_numa_masks);
> > +	if (!k.masks)
> > +		goto unlock;
> 
> > +	hop = (struct cpumask ***)
> > +		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;
> 
> Strange indentation. I would rather see the split on parameters and
> maybe '-' operator.
> 
> sizeof(*k.masks) is a bit shorter, right?
> 
> Also we may go with
> 
> 
> 	struct cpumask ***masks;
> 	struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };
> 
> 
> 
> > +	ret = hop ?
> > +		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
> > +		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);
> 
> > +unlock:
> 
> out_unlock: shows the intention more clearly, no?
> 
> > +	rcu_read_unlock();
> > +	return ret;
> > +}

Below is a diff I have got on top of your patch, only compile tested:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 024f1da0e941..e04262578b52 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2070,26 +2070,28 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 }
 
 struct __cmp_key {
-	const struct cpumask *cpus;
 	struct cpumask ***masks;
+	const struct cpumask *cpus;
 	int node;
 	int cpu;
 	int w;
 };
 
-static int cmp(const void *a, const void *b)
+static int cmp(const void *key, const void *pivot)
 {
-	struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
-	struct cpumask **cur_hop = *(struct cpumask ***)b;
-	struct __cmp_key *k = (struct __cmp_key *)a;
+	struct __cmp_key *k = container_of(key, struct __cmp_key, masks);
+	const struct cpumask ***masks = (const struct cpumask ***)pivot;
+	const struct cpumask **prev = masks[-1];
+	const struct cpumask **cur = masks[0];
 
-	if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
+	if (cpumask_weight_and(k->cpus, cur[k->node]) <= k->cpu)
 		return 1;
 
-	k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
-	if (k->w <= k->cpu)
+	k->w = 0;
+	if (masks == (const struct cpumask ***)k->masks)
 		return 0;
 
+	k->w = cpumask_weight_and(k->cpus, prev[k->node]);
 	return -1;
 }
 
@@ -2103,17 +2105,17 @@ static int cmp(const void *a, const void *b)
  */
 int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
 {
-	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };
+	struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };
 	int hop, ret = nr_cpu_ids;
+	struct cpumask ***masks;
 
 	rcu_read_lock();
 	k.masks = rcu_dereference(sched_domains_numa_masks);
 	if (!k.masks)
 		goto unlock;
 
-	hop = (struct cpumask ***)
-		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;
-
+	masks = bsearch(&k.masks, k.masks, sched_domains_numa_levels, sizeof(*k.masks), cmp);
+	hop = masks - k.masks;
 	ret = hop ?
 		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
 		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
                   ` (3 preceding siblings ...)
  2022-11-12 19:09 ` [PATCH v2 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
@ 2022-11-15 17:24 ` Valentin Schneider
  2022-11-15 18:32   ` Yury Norov
  4 siblings, 1 reply; 16+ messages in thread
From: Valentin Schneider @ 2022-11-15 17:24 UTC (permalink / raw)
  To: Yury Norov, linux-kernel, David S. Miller, Andy Shevchenko,
	Barry Song, Ben Segall, haniel Bristot de Oliveira,
	Dietmar Eggemann, Gal Pressman, Greg Kroah-Hartman,
	Heiko Carstens, Ingo Molnar, Jakub Kicinski, Jason Gunthorpe,
	Jesse Brandeburg, Jonathan Cameron, Juri Lelli, Leon Romanovsky,
	Mel Gorman, Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed,
	Steven Rostedt, Tariq Toukan, Tariq Toukan, Tony Luck,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

Hi,

On 12/11/22 11:09, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
>
> This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
> Improve remote NUMA preferences used for the IRQ affinity hints"
>
> https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
>
> According to their measurements, for mlx5e:
>
>         Bottleneck in RX side is released, reached linerate (~1.8x speedup).
>         ~30% less cpu util on TX.
>
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparabale improvement for its
> users, as in case of mlx5e.
>
> I tested new behavior on my VM with the following NUMA configuration:
>
> root@debian:~# numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3
> node 0 size: 3869 MB
> node 0 free: 3740 MB
> node 1 cpus: 4 5
> node 1 size: 1969 MB
> node 1 free: 1937 MB
> node 2 cpus: 6 7
> node 2 size: 1967 MB
> node 2 free: 1873 MB
> node 3 cpus: 8 9 10 11 12 13 14 15
> node 3 size: 7842 MB
> node 3 free: 7723 MB
> node distances:
> node   0   1   2   3
>   0:  10  50  30  70
>   1:  50  10  70  30
>   2:  30  70  10  50
>   3:  70  30  50  10
>
> And the cpumask_local_spread() for each node and offset traversing looks
> like this:
>
> node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
>

Is this meant as a replacement for [1]?

I like that this is changing an existing interface so that all current
users directly benefit from the change. Now, about half of the users of
cpumask_local_spread() use it in a loop with incremental @i parameter,
which makes the repeated bsearch a bit of a shame, but then I'm tempted to
say the first point makes it worth it.

[1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/

> v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
> v2:
>  - use bsearch() in sched_numa_find_nth_cpu();
>  - fix missing 'static inline' in 3rd patch.
>
> Yury Norov (4):
>   lib/find: introduce find_nth_and_andnot_bit
>   cpumask: introduce cpumask_nth_and_andnot
>   sched: add sched_numa_find_nth_cpu()
>   cpumask: improve on cpumask_local_spread() locality
>
>  include/linux/cpumask.h  | 20 +++++++++++++++
>  include/linux/find.h     | 33 ++++++++++++++++++++++++
>  include/linux/topology.h |  8 ++++++
>  kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
>  lib/cpumask.c            | 12 ++-------
>  lib/find_bit.c           |  9 +++++++
>  6 files changed, 127 insertions(+), 10 deletions(-)
>
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-12 19:09 ` [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
  2022-11-14 14:32   ` Andy Shevchenko
@ 2022-11-15 17:25   ` Valentin Schneider
  1 sibling, 0 replies; 16+ messages in thread
From: Valentin Schneider @ 2022-11-15 17:25 UTC (permalink / raw)
  To: Yury Norov, linux-kernel, David S. Miller, Andy Shevchenko,
	Barry Song, Ben Segall, haniel Bristot de Oliveira,
	Dietmar Eggemann, Gal Pressman, Greg Kroah-Hartman,
	Heiko Carstens, Ingo Molnar, Jakub Kicinski, Jason Gunthorpe,
	Jesse Brandeburg, Jonathan Cameron, Juri Lelli, Leon Romanovsky,
	Mel Gorman, Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed,
	Steven Rostedt, Tariq Toukan, Tariq Toukan, Tony Luck,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

On 12/11/22 11:09, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
>
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.
>

So one thing regarding the bsearch and NUMA levels; until not so long ago
we couldn't even support 3 hops [1], and this only got detected when such
machines started showing up.

Your bsearch here operates on NUMA levels, which represent hops, and so far
we know of systems that have up to 4 levels. I'd be surprised (and also
appalled) if we even doubled that in the next decade, so with that in mind,
a linear walk might not be so horrible.

[1]: https://lore.kernel.org/all/20210224030944.15232-1-song.bao.hua@hisilicon.com/


> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> +	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };
> +	int hop, ret = nr_cpu_ids;
> +
> +	rcu_read_lock();
> +	k.masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!k.masks)
> +		goto unlock;
> +
> +	hop = (struct cpumask ***)
> +		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;
> +
> +	ret = hop ?
> +		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
> +		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);
                                      ^^^
                  wouldn't this always be 0 here?

> +unlock:
> +	rcu_read_unlock();
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
>  #endif /* CONFIG_NUMA */
>
>  static int __sdt_alloc(const struct cpumask *cpu_map)
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-15 17:24 ` [PATCH v2 0/4] " Valentin Schneider
@ 2022-11-15 18:32   ` Yury Norov
  2022-11-17 12:23     ` Valentin Schneider
  0 siblings, 1 reply; 16+ messages in thread
From: Yury Norov @ 2022-11-15 18:32 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Vincent Guittot,
	linux-crypto, netdev, linux-rdma

On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> Hi,
> 
> On 12/11/22 11:09, Yury Norov wrote:
> > cpumask_local_spread() currently checks local node for presence of i'th
> > CPU, and then if it finds nothing makes a flat search among all non-local
> > CPUs. We can do it better by checking CPUs per NUMA hops.
> >
> > This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
> > Improve remote NUMA preferences used for the IRQ affinity hints"
> >
> > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> >
> > According to their measurements, for mlx5e:
> >
> >         Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> >         ~30% less cpu util on TX.
> >
> > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> > distance, just as well, and I expect comparabale improvement for its
> > users, as in case of mlx5e.
> >
> > I tested new behavior on my VM with the following NUMA configuration:
> >
> > root@debian:~# numactl -H
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 3869 MB
> > node 0 free: 3740 MB
> > node 1 cpus: 4 5
> > node 1 size: 1969 MB
> > node 1 free: 1937 MB
> > node 2 cpus: 6 7
> > node 2 size: 1967 MB
> > node 2 free: 1873 MB
> > node 3 cpus: 8 9 10 11 12 13 14 15
> > node 3 size: 7842 MB
> > node 3 free: 7723 MB
> > node distances:
> > node   0   1   2   3
> >   0:  10  50  30  70
> >   1:  50  10  70  30
> >   2:  30  70  10  50
> >   3:  70  30  50  10
> >
> > And the cpumask_local_spread() for each node and offset traversing looks
> > like this:
> >
> > node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
> > node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
> > node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
> > node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3
> >
> 
> Is this meant as a replacement for [1]?

No. Your series adds an iterator, and in my experience the code that
uses iterators of that sort is almost always better and easier to
understand than cpumask_nth() or cpumask_next()-like users.

My series has the only advantage that it allows keep existing codebase
untouched.
 
> I like that this is changing an existing interface so that all current
> users directly benefit from the change. Now, about half of the users of
> cpumask_local_spread() use it in a loop with incremental @i parameter,
> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> say the first point makes it worth it.
> 
> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/

In terms of very common case of sequential invocation of local_spread()
for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
and your approach is amortized O(n), which is better. Not a big deal _now_,
as you mentioned in the other email. But we never know how things will
evolve, right?

So, I would take both and maybe in comment to cpumask_local_spread()
mention that there's a better alternative for those who call the
function for all CPUs incrementally.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-15 18:32   ` Yury Norov
@ 2022-11-17 12:23     ` Valentin Schneider
  2022-11-28  6:39       ` Tariq Toukan
  0 siblings, 1 reply; 16+ messages in thread
From: Valentin Schneider @ 2022-11-17 12:23 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Vincent Guittot,
	linux-crypto, netdev, linux-rdma

On 15/11/22 10:32, Yury Norov wrote:
> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>
>> Is this meant as a replacement for [1]?
>
> No. Your series adds an iterator, and in my experience the code that
> uses iterators of that sort is almost always better and easier to
> understand than cpumask_nth() or cpumask_next()-like users.
>
> My series has the only advantage that it allows keep existing codebase
> untouched.
>

Right

>> I like that this is changing an existing interface so that all current
>> users directly benefit from the change. Now, about half of the users of
>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>> say the first point makes it worth it.
>>
>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>
> In terms of very common case of sequential invocation of local_spread()
> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> and your approach is amortized O(n), which is better. Not a big deal _now_,
> as you mentioned in the other email. But we never know how things will
> evolve, right?
>
> So, I would take both and maybe in comment to cpumask_local_spread()
> mention that there's a better alternative for those who call the
> function for all CPUs incrementally.
>

Ack, sounds good.

> Thanks,
> Yury


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-17 12:23     ` Valentin Schneider
@ 2022-11-28  6:39       ` Tariq Toukan
  2022-11-30  1:47         ` Yury Norov
  0 siblings, 1 reply; 16+ messages in thread
From: Tariq Toukan @ 2022-11-28  6:39 UTC (permalink / raw)
  To: Valentin Schneider, Yury Norov
  Cc: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tony Luck, Vincent Guittot, linux-crypto, netdev,
	linux-rdma



On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> On 15/11/22 10:32, Yury Norov wrote:
>> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>>
>>> Is this meant as a replacement for [1]?
>>
>> No. Your series adds an iterator, and in my experience the code that
>> uses iterators of that sort is almost always better and easier to
>> understand than cpumask_nth() or cpumask_next()-like users.
>>
>> My series has the only advantage that it allows keep existing codebase
>> untouched.
>>
> 
> Right
> 
>>> I like that this is changing an existing interface so that all current
>>> users directly benefit from the change. Now, about half of the users of
>>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>>> say the first point makes it worth it.
>>>
>>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>>
>> In terms of very common case of sequential invocation of local_spread()
>> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
>> and your approach is amortized O(n), which is better. Not a big deal _now_,
>> as you mentioned in the other email. But we never know how things will
>> evolve, right?
>>
>> So, I would take both and maybe in comment to cpumask_local_spread()
>> mention that there's a better alternative for those who call the
>> function for all CPUs incrementally.
>>
> 
> Ack, sounds good.
> 

Good.
Is a respin needed, to add the comment mentioned above?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-28  6:39       ` Tariq Toukan
@ 2022-11-30  1:47         ` Yury Norov
  2022-12-07 12:53           ` Tariq Toukan
  0 siblings, 1 reply; 16+ messages in thread
From: Yury Norov @ 2022-11-30  1:47 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Valentin Schneider, linux-kernel, David S. Miller,
	Andy Shevchenko, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tony Luck,
	Vincent Guittot, linux-crypto, netdev, linux-rdma

On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
> 
> 
> On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> > On 15/11/22 10:32, Yury Norov wrote:
> > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> > > > 
> > > > Is this meant as a replacement for [1]?
> > > 
> > > No. Your series adds an iterator, and in my experience the code that
> > > uses iterators of that sort is almost always better and easier to
> > > understand than cpumask_nth() or cpumask_next()-like users.
> > > 
> > > My series has the only advantage that it allows keep existing codebase
> > > untouched.
> > > 
> > 
> > Right
> > 
> > > > I like that this is changing an existing interface so that all current
> > > > users directly benefit from the change. Now, about half of the users of
> > > > cpumask_local_spread() use it in a loop with incremental @i parameter,
> > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> > > > say the first point makes it worth it.
> > > > 
> > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
> > > 
> > > In terms of very common case of sequential invocation of local_spread()
> > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> > > and your approach is amortized O(n), which is better. Not a big deal _now_,
> > > as you mentioned in the other email. But we never know how things will
> > > evolve, right?
> > > 
> > > So, I would take both and maybe in comment to cpumask_local_spread()
> > > mention that there's a better alternative for those who call the
> > > function for all CPUs incrementally.
> > > 
> > 
> > Ack, sounds good.
> > 
> 
> Good.
> Is a respin needed, to add the comment mentioned above?

If you think it's worth the effort.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-30  1:47         ` Yury Norov
@ 2022-12-07 12:53           ` Tariq Toukan
  2022-12-07 20:45             ` Yury Norov
  0 siblings, 1 reply; 16+ messages in thread
From: Tariq Toukan @ 2022-12-07 12:53 UTC (permalink / raw)
  To: Yury Norov
  Cc: Valentin Schneider, linux-kernel, David S. Miller,
	Andy Shevchenko, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tony Luck,
	Vincent Guittot, linux-crypto, netdev, linux-rdma



On 11/30/2022 3:47 AM, Yury Norov wrote:
> On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
>>
>>
>> On 11/17/2022 2:23 PM, Valentin Schneider wrote:
>>> On 15/11/22 10:32, Yury Norov wrote:
>>>> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
>>>>>
>>>>> Is this meant as a replacement for [1]?
>>>>
>>>> No. Your series adds an iterator, and in my experience the code that
>>>> uses iterators of that sort is almost always better and easier to
>>>> understand than cpumask_nth() or cpumask_next()-like users.
>>>>
>>>> My series has the only advantage that it allows keep existing codebase
>>>> untouched.
>>>>
>>>
>>> Right
>>>
>>>>> I like that this is changing an existing interface so that all current
>>>>> users directly benefit from the change. Now, about half of the users of
>>>>> cpumask_local_spread() use it in a loop with incremental @i parameter,
>>>>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to
>>>>> say the first point makes it worth it.
>>>>>
>>>>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
>>>>
>>>> In terms of very common case of sequential invocation of local_spread()
>>>> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
>>>> and your approach is amortized O(n), which is better. Not a big deal _now_,
>>>> as you mentioned in the other email. But we never know how things will
>>>> evolve, right?
>>>>
>>>> So, I would take both and maybe in comment to cpumask_local_spread()
>>>> mention that there's a better alternative for those who call the
>>>> function for all CPUs incrementally.
>>>>
>>>
>>> Ack, sounds good.
>>>
>>
>> Good.
>> Is a respin needed, to add the comment mentioned above?
> 
> If you think it's worth the effort.

No, not sure it is...

I asked because this mail thread was inactive for a while, with the 
patches not accepted to the kernel yet.

If everyone is happy with it, let's make it to this kernel while possible.

To which tree should it go?

Thanks,
Tariq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-12-07 12:53           ` Tariq Toukan
@ 2022-12-07 20:45             ` Yury Norov
  0 siblings, 0 replies; 16+ messages in thread
From: Yury Norov @ 2022-12-07 20:45 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Valentin Schneider, linux-kernel, David S. Miller,
	Andy Shevchenko, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tony Luck,
	Vincent Guittot, linux-crypto, netdev, linux-rdma

On Wed, Dec 07, 2022 at 02:53:58PM +0200, Tariq Toukan wrote:
> 
> 
> On 11/30/2022 3:47 AM, Yury Norov wrote:
> > On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote:
> > > 
> > > 
> > > On 11/17/2022 2:23 PM, Valentin Schneider wrote:
> > > > On 15/11/22 10:32, Yury Norov wrote:
> > > > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote:
> > > > > > 
> > > > > > Is this meant as a replacement for [1]?
> > > > > 
> > > > > No. Your series adds an iterator, and in my experience the code that
> > > > > uses iterators of that sort is almost always better and easier to
> > > > > understand than cpumask_nth() or cpumask_next()-like users.
> > > > > 
> > > > > My series has the only advantage that it allows keep existing codebase
> > > > > untouched.
> > > > > 
> > > > 
> > > > Right
> > > > 
> > > > > > I like that this is changing an existing interface so that all current
> > > > > > users directly benefit from the change. Now, about half of the users of
> > > > > > cpumask_local_spread() use it in a loop with incremental @i parameter,
> > > > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to
> > > > > > say the first point makes it worth it.
> > > > > > 
> > > > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/
> > > > > 
> > > > > In terms of very common case of sequential invocation of local_spread()
> > > > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n,
> > > > > and your approach is amortized O(n), which is better. Not a big deal _now_,
> > > > > as you mentioned in the other email. But we never know how things will
> > > > > evolve, right?
> > > > > 
> > > > > So, I would take both and maybe in comment to cpumask_local_spread()
> > > > > mention that there's a better alternative for those who call the
> > > > > function for all CPUs incrementally.
> > > > > 
> > > > 
> > > > Ack, sounds good.
> > > > 
> > > 
> > > Good.
> > > Is a respin needed, to add the comment mentioned above?
> > 
> > If you think it's worth the effort.
> 
> No, not sure it is...
> 
> I asked because this mail thread was inactive for a while, with the patches
> not accepted to the kernel yet.
> 
> If everyone is happy with it, let's make it to this kernel while possible.
> 
> To which tree should it go?

I've got bitmap tree and can move it there, but this series is related to
scheduler and NUMA as well, and I'd prefer move it in those trees.

If moving through bitmaps, I'd like to collect more reviews and testing.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-14 14:32   ` Andy Shevchenko
  2022-11-14 15:02     ` Andy Shevchenko
@ 2022-12-08  2:55     ` Yury Norov
  1 sibling, 0 replies; 16+ messages in thread
From: Yury Norov @ 2022-12-08  2:55 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	haniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Mon, Nov 14, 2022 at 04:32:09PM +0200, Andy Shevchenko wrote:
> On Sat, Nov 12, 2022 at 11:09:45AM -0800, Yury Norov wrote:
> > The function finds Nth set CPU in a given cpumask starting from a given
> > node.
> > 
> > Leveraging the fact that each hop in sched_domains_numa_masks includes the
> > same or greater number of CPUs than the previous one, we can use binary
> > search on hops instead of linear walk, which makes the overall complexity
> > of O(log n) in terms of number of cpumask_weight() calls.
> 
> ...
> 
> > +struct __cmp_key {
> > +	const struct cpumask *cpus;
> > +	struct cpumask ***masks;
> > +	int node;
> > +	int cpu;
> > +	int w;
> > +};
> > +
> > +static int cmp(const void *a, const void *b)
> 
> Calling them key and pivot (as in the caller), would make more sense.

I think they are named opaque intentionally, so that user (me) would
cast them to proper data structures and give meaningful names. So I did.
 
> > +{
> 
> What about
> 
> 	const (?) struct cpumask ***masks = (...)pivot;
> 
> > +	struct cpumask **prev_hop = *((struct cpumask ***)b - 1);
> 
> 	= masks[-1];
> 
> > +	struct cpumask **cur_hop = *(struct cpumask ***)b;
> 
> 	= masks[0];
> 
> ?

It would work as well. Not better neither worse.

> > +	struct __cmp_key *k = (struct __cmp_key *)a;
> 
> > +	if (cpumask_weight_and(k->cpus, cur_hop[k->node]) <= k->cpu)
> > +		return 1;
> 
> > +	k->w = (b == k->masks) ? 0 : cpumask_weight_and(k->cpus, prev_hop[k->node]);
> > +	if (k->w <= k->cpu)
> > +		return 0;
> 
> Can k->cpu be negative?

User may pass negative value. Currently cpumask_local_spread() will
return nr_cpu_ids.

After rework, bsearch() will return hop #0, After that cpumask_nth_and()
will cast negative cpu to unsigned long, and because it's a too big number,
again will return nr_cpu_ids.

> If no, we can rewrite above as
> 
> 	k->w = 0;
> 	if (b == k->masks)
> 		return 0;
> 
> 	k->w = cpumask_weight_and(k->cpus, prev_hop[k->node]);

Here we still need to compare weight of prev_hop against k->cpu.
Returning -1 unconditionally is wrong.

> > +	return -1;
> > +}
> 
> ...
> 
> > +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > +{
> > +	struct __cmp_key k = { cpus, NULL, node, cpu, 0 };
> 
> You can drop NULL and 0 while using C99 assignments.
> 
> > +	int hop, ret = nr_cpu_ids;
> 
> > +	rcu_read_lock();
> 
> + Blank line?
> 
> > +	k.masks = rcu_dereference(sched_domains_numa_masks);
> > +	if (!k.masks)
> > +		goto unlock;
> 
> > +	hop = (struct cpumask ***)
> > +		bsearch(&k, k.masks, sched_domains_numa_levels, sizeof(k.masks[0]), cmp) - k.masks;
> 
> Strange indentation. I would rather see the split on parameters and
> maybe '-' operator.
> 
> sizeof(*k.masks) is a bit shorter, right?
> 
> Also we may go with
> 
> 
> 	struct cpumask ***masks;
> 	struct __cmp_key k = { .cpus = cpus, .node = node, .cpu = cpu };
> 
> 
> 
> > +	ret = hop ?
> > +		cpumask_nth_and_andnot(cpu - k.w, cpus, k.masks[hop][node], k.masks[hop-1][node]) :
> > +		cpumask_nth_and(cpu - k.w, cpus, k.masks[0][node]);
> 
> > +unlock:
> 
> out_unlock: shows the intention more clearly, no?

No

> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-12-08  2:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-12 19:09 [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
2022-11-12 19:09 ` [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
2022-11-12 19:09 ` [PATCH v2 2/4] cpumask: introduce cpumask_nth_and_andnot Yury Norov
2022-11-12 19:09 ` [PATCH v2 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
2022-11-14 14:32   ` Andy Shevchenko
2022-11-14 15:02     ` Andy Shevchenko
2022-12-08  2:55     ` Yury Norov
2022-11-15 17:25   ` Valentin Schneider
2022-11-12 19:09 ` [PATCH v2 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
2022-11-15 17:24 ` [PATCH v2 0/4] " Valentin Schneider
2022-11-15 18:32   ` Yury Norov
2022-11-17 12:23     ` Valentin Schneider
2022-11-28  6:39       ` Tariq Toukan
2022-11-30  1:47         ` Yury Norov
2022-12-07 12:53           ` Tariq Toukan
2022-12-07 20:45             ` Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).