linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality
@ 2022-11-11  4:00 Yury Norov
  2022-11-11  4:00 ` [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:00 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Valentin Schneider's "net/mlx5e: Improve remote
NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to Valentin's measurements, for mlx5e:

	Bottleneck in RX side is released, reached linerate (~1.8x speedup).
	~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in Valentin's case.

I tested it on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

Yury Norov (4):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality

 include/linux/cpumask.h  | 20 +++++++++++++++++++
 include/linux/find.h     | 33 +++++++++++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++++
 kernel/sched/topology.c  | 42 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 12 ++----------
 lib/find_bit.c           |  9 +++++++++
 6 files changed, 114 insertions(+), 10 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit
  2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
@ 2022-11-11  4:00 ` Yury Norov
  2022-11-11  4:00 ` [PATCH 2/4] cpumask: introduce cpumask_nth_and_andnot() Yury Norov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:00 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

The function is used to implement in-place bitmaps traversing without
storing intermediate result in temporary bitmaps, in the following patches.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
 lib/find_bit.c       |  9 +++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index ccaf61a0f5fd..7a97b492b447 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -22,6 +22,9 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
 				unsigned long size, unsigned long n);
 unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long size, unsigned long n);
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+					const unsigned long *addr3, unsigned long size,
+					unsigned long n);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -255,6 +258,36 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
 	return __find_nth_andnot_bit(addr1, addr2, size, n);
 }
 
+/**
+ * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
+ *			     excluding those set in 3rd region
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @addr3: The 3rd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th set bit.
+ * If no such, returns @size.
+ */
+static inline
+unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
+					const unsigned long *addr2,
+					const unsigned long *addr3,
+					unsigned long size, unsigned long n)
+{
+	if (n >= size)
+		return size;
+
+	if (small_const_nbits(size)) {
+		unsigned long val =  *addr1 & *addr2 & (~*addr2) & GENMASK(size - 1, 0);
+
+		return val ? fns(val, n) : size;
+	}
+
+	return __find_nth_and_andnot_bit(addr1, addr2, addr3, size, n);
+}
+
 #ifndef find_first_and_bit
 /**
  * find_first_and_bit - find the first set bit in both memory regions
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18bc0a7ac8ee..c10920e66788 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -155,6 +155,15 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
 }
 EXPORT_SYMBOL(__find_nth_andnot_bit);
 
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
+					const unsigned long *addr2,
+					const unsigned long *addr3,
+					unsigned long size, unsigned long n)
+{
+	return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_and_andnot_bit);
+
 #ifndef find_next_and_bit
 unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
 					unsigned long nbits, unsigned long start)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] cpumask: introduce cpumask_nth_and_andnot()
  2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-11  4:00 ` [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
@ 2022-11-11  4:00 ` Yury Norov
  2022-11-11  4:00 ` [PATCH 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:00 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/cpumask.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c2aa0aa26b45..debfa2261569 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -391,6 +391,26 @@ unsigned int cpumask_nth_andnot(unsigned int cpu, const struct cpumask *srcp1,
 				nr_cpumask_bits, cpumask_check(cpu));
 }
 
+/**
+ * cpumask_nth_and_andnot - get the Nth cpu set in 1st and 2nd cpumask, and clear in 3rd.
+ * @srcp1: the cpumask pointer
+ * @srcp2: the cpumask pointer
+ * @srcp3: the cpumask pointer
+ * @cpu: the N'th cpu to find, starting from 0
+ *
+ * Returns >= nr_cpu_ids if such cpu doesn't exist.
+ */
+static inline
+unsigned int cpumask_nth_and_andnot(unsigned int cpu, const struct cpumask *srcp1,
+							const struct cpumask *srcp2,
+							const struct cpumask *srcp3)
+{
+	return find_nth_and_andnot_bit(cpumask_bits(srcp1),
+					cpumask_bits(srcp2),
+					cpumask_bits(srcp3),
+					nr_cpumask_bits, cpumask_check(cpu));
+}
+
 #define CPU_BITS_NONE						\
 {								\
 	[0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL			\
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-11  4:00 ` [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
  2022-11-11  4:00 ` [PATCH 2/4] cpumask: introduce cpumask_nth_and_andnot() Yury Norov
@ 2022-11-11  4:00 ` Yury Norov
  2022-11-11  4:11   ` Yury Norov
  2022-11-11 11:42   ` Andy Shevchenko
  2022-11-11  4:00 ` [PATCH 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
  2022-11-11 16:25 ` [PATCH 0/4] " Jakub Kicinski
  4 siblings, 2 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:00 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 include/linux/topology.h |  8 ++++++++
 kernel/sched/topology.c  | 42 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..63048ac3207c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+#else
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+	return cpumask_nth(cpu, cpus);
+}
+#endif	/* CONFIG_NUMA */
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..c8f56287de46 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,48 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/*
+ * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
+ *                             closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: Nth cpu to find
+ *
+ * returns: cpu, or >= nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+	unsigned int first = 0, mid, last = sched_domains_numa_levels;
+	struct cpumask ***masks;
+	int w, ret = nr_cpu_ids;
+
+	rcu_read_lock();
+	masks = rcu_dereference(sched_domains_numa_masks);
+	if (!masks)
+		goto out;
+
+	while (last >= first) {
+		mid = (last + first) / 2;
+
+		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
+			first = mid + 1;
+			continue;
+		}
+
+		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
+		if (w <= cpu)
+			break;
+
+		last = mid - 1;
+	}
+
+	ret = (mid == 0) ?
+		cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
+		cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
+out:
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
                   ` (2 preceding siblings ...)
  2022-11-11  4:00 ` [PATCH 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
@ 2022-11-11  4:00 ` Yury Norov
  2022-11-11 16:25 ` [PATCH 0/4] " Jakub Kicinski
  4 siblings, 0 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:00 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

Switch cpumask_local_spread() to newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 lib/cpumask.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index c7c392514fd3..255974cd6734 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -110,7 +110,7 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 #endif
 
 /**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * cpumask_local_spread - select the i'th cpu based on NUMA distances
  * @i: index number
  * @node: local numa_node
  *
@@ -132,15 +132,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
 		if (cpu < nr_cpu_ids)
 			return cpu;
 	} else {
-		/* NUMA first. */
-		cpu = cpumask_nth_and(i, cpu_online_mask, cpumask_of_node(node));
-		if (cpu < nr_cpu_ids)
-			return cpu;
-
-		i -= cpumask_weight_and(cpu_online_mask, cpumask_of_node(node));
-
-		/* Skip NUMA nodes, done above. */
-		cpu = cpumask_nth_andnot(i, cpu_online_mask, cpumask_of_node(node));
+		cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
 		if (cpu < nr_cpu_ids)
 			return cpu;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11  4:00 ` [PATCH 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
@ 2022-11-11  4:11   ` Yury Norov
  2022-11-11 11:42   ` Andy Shevchenko
  1 sibling, 0 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11  4:11 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: linux-crypto, netdev, linux-rdma

On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
> 
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.
> 
> Signed-off-by: Yury Norov <yury.norov@gmail.com>
> ---
>  include/linux/topology.h |  8 ++++++++
>  kernel/sched/topology.c  | 42 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..63048ac3207c 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>  	return cpumask_of_node(cpu_to_node(cpu));
>  }
>  
> +#ifdef CONFIG_NUMA
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
> +#else
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)

Ah, this should be static of course.

> +{
> +	return cpumask_nth(cpu, cpus);
> +}
> +#endif	/* CONFIG_NUMA */
>  
>  #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..c8f56287de46 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,48 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  	return found;
>  }
>  
> +/*
> + * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
> + *                             closest to @cpu from @cpumask.
> + * cpumask: cpumask to find a cpu from
> + * cpu: Nth cpu to find
> + *
> + * returns: cpu, or >= nr_cpu_ids when nothing found.
> + */
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> +	unsigned int first = 0, mid, last = sched_domains_numa_levels;
> +	struct cpumask ***masks;
> +	int w, ret = nr_cpu_ids;
> +
> +	rcu_read_lock();
> +	masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!masks)
> +		goto out;
> +
> +	while (last >= first) {
> +		mid = (last + first) / 2;
> +
> +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> +			first = mid + 1;
> +			continue;
> +		}
> +
> +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> +		if (w <= cpu)
> +			break;
> +
> +		last = mid - 1;
> +	}
> +
> +	ret = (mid == 0) ?
> +		cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> +		cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
> +out:
> +	rcu_read_unlock();
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
>  #endif /* CONFIG_NUMA */
>  
>  static int __sdt_alloc(const struct cpumask *cpu_map)
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11  4:00 ` [PATCH 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
  2022-11-11  4:11   ` Yury Norov
@ 2022-11-11 11:42   ` Andy Shevchenko
  2022-11-11 17:07     ` Yury Norov
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Shevchenko @ 2022-11-11 11:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
> 
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.

...

> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> +	unsigned int first = 0, mid, last = sched_domains_numa_levels;
> +	struct cpumask ***masks;

*** ?
Hmm... Do we really need such deep indirection?

> +	int w, ret = nr_cpu_ids;
> +
> +	rcu_read_lock();
> +	masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!masks)
> +		goto out;
> +
> +	while (last >= first) {
> +		mid = (last + first) / 2;
> +
> +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> +			first = mid + 1;
> +			continue;
> +		}
> +
> +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);

See below.

> +		if (w <= cpu)
> +			break;
> +
> +		last = mid - 1;
> +	}

We have lib/bsearch.h. I haven't really looked deeply into the above, but my
gut feelings that that might be useful here. Can you check that?

> +	ret = (mid == 0) ?
> +		cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> +		cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);

You can also shorten this by inversing the conditional:

	ret = mid ? ...not 0... : ...for 0...;

> +out:

out_unlock: ?

> +	rcu_read_unlock();
> +	return ret;
> +}

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
                   ` (3 preceding siblings ...)
  2022-11-11  4:00 ` [PATCH 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
@ 2022-11-11 16:25 ` Jakub Kicinski
       [not found]   ` <CAAH8bW9jG5US0Ymn1wax9tNK3MgZpcWfQsYgu-Km_E+WZw3yiA@mail.gmail.com>
  4 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-11-11 16:25 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.

Nice.

> This series is inspired by Valentin Schneider's "net/mlx5e: Improve remote
> NUMA preferences used for the IRQ affinity hints"
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/
> 
> According to Valentin's measurements, for mlx5e:
> 
> 	Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> 	~30% less cpu util on TX.
> 
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparabale improvement for its
> users, as in Valentin's case.
> 
> I tested it on my VM with the following NUMA configuration:

nit: the authorship is a bit more complicated, it'd be good to mention
Tariq. Both for the code and attribution of the testing / measurements.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11 11:42   ` Andy Shevchenko
@ 2022-11-11 17:07     ` Yury Norov
  2022-11-11 18:14       ` Andy Shevchenko
  2022-11-12 18:14       ` Yury Norov
  0 siblings, 2 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-11 17:07 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> > The function finds Nth set CPU in a given cpumask starting from a given
> > node.
> > 
> > Leveraging the fact that each hop in sched_domains_numa_masks includes the
> > same or greater number of CPUs than the previous one, we can use binary
> > search on hops instead of linear walk, which makes the overall complexity
> > of O(log n) in terms of number of cpumask_weight() calls.
> 
> ...
> 
> > +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > +{
> > +	unsigned int first = 0, mid, last = sched_domains_numa_levels;
> > +	struct cpumask ***masks;
> 
> *** ?
> Hmm... Do we really need such deep indirection?

It's 2d array of pointers, so - yes.
 
> > +	int w, ret = nr_cpu_ids;
> > +
> > +	rcu_read_lock();
> > +	masks = rcu_dereference(sched_domains_numa_masks);
> > +	if (!masks)
> > +		goto out;
> > +
> > +	while (last >= first) {
> > +		mid = (last + first) / 2;
> > +
> > +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> > +			first = mid + 1;
> > +			continue;
> > +		}
> > +
> > +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> 
> See below.
> 
> > +		if (w <= cpu)
> > +			break;
> > +
> > +		last = mid - 1;
> > +	}
> 
> We have lib/bsearch.h. I haven't really looked deeply into the above, but my
> gut feelings that that might be useful here. Can you check that?

Yes we do. I tried it, and it didn't work because nodes arrays are
allocated dynamically, and distance between different pairs of hops
for a given node is not a constant, which is a requirement for
bsearch.

However, distance between hops pointers in 1st level array should be
constant, and we can try feeding bsearch with it. I'll experiment with
bsearch for more.

> > +	ret = (mid == 0) ?
> > +		cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> > +		cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
> 
> You can also shorten this by inversing the conditional:
> 
> 	ret = mid ? ...not 0... : ...for 0...;

Yep, why not.

> > +out:
> 
> out_unlock: ?

Do you think it's better?

> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11 17:07     ` Yury Norov
@ 2022-11-11 18:14       ` Andy Shevchenko
  2022-11-12 18:14       ` Yury Norov
  1 sibling, 0 replies; 13+ messages in thread
From: Andy Shevchenko @ 2022-11-11 18:14 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Fri, Nov 11, 2022 at 09:07:15AM -0800, Yury Norov wrote:
> On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> > On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:

...

> > > +out:
> > 
> > out_unlock: ?
> 
> Do you think it's better?

Yes. It shows what will happen at goto.

So when one reads the "goto out;" it's something like "return ret;".
But "goto out_unlock;" immediately pictures "unlock; return ret;".

P.S. That's basically the way how we name labels.

> > > +	rcu_read_unlock();
> > > +	return ret;

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()
  2022-11-11 17:07     ` Yury Norov
  2022-11-11 18:14       ` Andy Shevchenko
@ 2022-11-12 18:14       ` Yury Norov
  1 sibling, 0 replies; 13+ messages in thread
From: Yury Norov @ 2022-11-12 18:14 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-kernel, David S. Miller, Barry Song, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jakub Kicinski,
	Jason Gunthorpe, Jesse Brandeburg, Jonathan Cameron, Juri Lelli,
	Leon Romanovsky, Mel Gorman, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot, linux-crypto,
	netdev, linux-rdma

On Fri, Nov 11, 2022 at 09:07:17AM -0800, Yury Norov wrote:
> On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> > On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> > > +	int w, ret = nr_cpu_ids;
> > > +
> > > +	rcu_read_lock();
> > > +	masks = rcu_dereference(sched_domains_numa_masks);
> > > +	if (!masks)
> > > +		goto out;
> > > +
> > > +	while (last >= first) {
> > > +		mid = (last + first) / 2;
> > > +
> > > +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> > > +			first = mid + 1;
> > > +			continue;
> > > +		}
> > > +
> > > +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> > 
> > See below.
> > 
> > > +		if (w <= cpu)
> > > +			break;
> > > +
> > > +		last = mid - 1;
> > > +	}
> > 
> > We have lib/bsearch.h. I haven't really looked deeply into the above, but my
> > gut feelings that that might be useful here. Can you check that?
> 
> Yes we do. I tried it, and it didn't work because nodes arrays are
> allocated dynamically, and distance between different pairs of hops
> for a given node is not a constant, which is a requirement for
> bsearch.
> 
> However, distance between hops pointers in 1st level array should be
> constant, and we can try feeding bsearch with it. I'll experiment with
> bsearch for more.

OK, I tried bsearch on array of hops, and it works. But it requires
adding some black pointers magic. I'll send v2 based on bsearch soon.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality
       [not found]   ` <CAAH8bW9jG5US0Ymn1wax9tNK3MgZpcWfQsYgu-Km_E+WZw3yiA@mail.gmail.com>
@ 2022-11-13  7:37     ` Tariq Toukan
  2022-11-13 12:29       ` Andy Shevchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Tariq Toukan @ 2022-11-13  7:37 UTC (permalink / raw)
  To: Yury Norov, Jakub Kicinski
  Cc: Linux Kernel Mailing List, David S. Miller, Andy Shevchenko,
	Barry Song, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, Gal Pressman, Greg Kroah-Hartman,
	Heiko Carstens, Ingo Molnar, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tony Luck, Valentin Schneider, Vincent Guittot,
	linux-crypto, Netdev, open list:HFI1 DRIVER



On 11/11/2022 6:47 PM, Yury Norov wrote:
> 
> 
> On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <kuba@kernel.org 
> <mailto:kuba@kernel.org>> wrote:
> 
>     On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:
>      > cpumask_local_spread() currently checks local node for presence
>     of i'th
>      > CPU, and then if it finds nothing makes a flat search among all
>     non-local
>      > CPUs. We can do it better by checking CPUs per NUMA hops.
> 
>     Nice.
> 

Thanks for your series.
This improves them all, with no changes required to the network device 
drivers.

>      > This series is inspired by Valentin Schneider's "net/mlx5e:
>     Improve remote
>      > NUMA preferences used for the IRQ affinity hints"
>      >
>      >
>     https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ <https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/>


Find my very first version here, including the perf testing results:
https://patchwork.kernel.org/project/netdevbpf/list/?series=660413&state=*


>      >
>      > According to Valentin's measurements, for mlx5e:
>      >
>      >       Bottleneck in RX side is released, reached linerate (~1.8x
>     speedup).
>      >       ~30% less cpu util on TX.
>      >
>      > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
>      > distance, just as well, and I expect comparabale improvement for its
>      > users, as in Valentin's case.
>      >

Right.

>      > I tested it on my VM with the following NUMA configuration:
> 
>     nit: the authorship is a bit more complicated, it'd be good to mention
>     Tariq. Both for the code and attribution of the testing / measurements.
> 
> 
> Sure. Tariq and Valentine please send your tags as appropriate.
> 

I wonder what fits best here?

As the contribution is based upon previous work that I developed, then 
probably:
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

Thanks,
Tariq

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality
  2022-11-13  7:37     ` Tariq Toukan
@ 2022-11-13 12:29       ` Andy Shevchenko
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Shevchenko @ 2022-11-13 12:29 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Yury Norov, Jakub Kicinski, Linux Kernel Mailing List,
	David S. Miller, Barry Song, Ben Segall,
	Daniel Bristot de Oliveira, Dietmar Eggemann, Gal Pressman,
	Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar, Jason Gunthorpe,
	Jesse Brandeburg, Jonathan Cameron, Juri Lelli, Leon Romanovsky,
	Mel Gorman, Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed,
	Steven Rostedt, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot, linux-crypto, Netdev, open list:HFI1 DRIVER

On Sun, Nov 13, 2022 at 09:37:59AM +0200, Tariq Toukan wrote:
> On 11/11/2022 6:47 PM, Yury Norov wrote:
> > On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <kuba@kernel.org
> > <mailto:kuba@kernel.org>> wrote:
> >     On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:

..

> > Sure. Tariq and Valentine please send your tags as appropriate.
> 
> I wonder what fits best here?
> 
> As the contribution is based upon previous work that I developed, then
> probably:
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

Then it probably means that one of you (Yury or you) should also have a
Co-developed-by.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-11-13 12:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11  4:00 [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
2022-11-11  4:00 ` [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
2022-11-11  4:00 ` [PATCH 2/4] cpumask: introduce cpumask_nth_and_andnot() Yury Norov
2022-11-11  4:00 ` [PATCH 3/4] sched: add sched_numa_find_nth_cpu() Yury Norov
2022-11-11  4:11   ` Yury Norov
2022-11-11 11:42   ` Andy Shevchenko
2022-11-11 17:07     ` Yury Norov
2022-11-11 18:14       ` Andy Shevchenko
2022-11-12 18:14       ` Yury Norov
2022-11-11  4:00 ` [PATCH 4/4] cpumask: improve on cpumask_local_spread() locality Yury Norov
2022-11-11 16:25 ` [PATCH 0/4] " Jakub Kicinski
     [not found]   ` <CAAH8bW9jG5US0Ymn1wax9tNK3MgZpcWfQsYgu-Km_E+WZw3yiA@mail.gmail.com>
2022-11-13  7:37     ` Tariq Toukan
2022-11-13 12:29       ` Andy Shevchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).