[PATCH net-next V4 0/3] Introduce and use NUMA distance metrics

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next V4 0/3] Introduce and use NUMA distance metrics
@ 2022-07-28 19:12 Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-07-28 19:12 UTC (permalink / raw)
  To: David S. Miller, Saeed Mahameed, Jakub Kicinski, Ingo Molnar,
	Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel, Tariq Toukan

Hi,

Implement and expose CPU spread API based on the scheduler's
sched_numa_find_closest().  Use it in mlx5 and enic device drivers.  This
replaces the binary NUMA preference (local / remote) with an improved one
that minds the actual distances, so that remote NUMAs with short distance
are preferred over farther ones.

This has significant performance implications when using NUMA-aware
memory allocations, improving the throughput and CPU utilization.

Regards,
Tariq

v4:
- memset to zero the cpus array in case !CONFIG_SMP.

v3:
- Introduce the logic as a common API instead of being mlx5 specific.
- Add implementation to enic device driver.
- Use non-atomic version of __cpumask_clear_cpu.

v2:
- Replace EXPORT_SYMBOL with EXPORT_SYMBOL_GPL, per Peter's comment.
- Separate the set_cpu operation into two functions, per Saeed's suggestion.
- Add Saeed's Acked-by signature.


Tariq Toukan (3):
  sched/topology: Add NUMA-based CPUs spread API
  net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
    hints
  enic: Use NUMA distances logic when setting affinity hints

 drivers/net/ethernet/cisco/enic/enic_main.c  | 10 +++-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c |  5 +-
 include/linux/sched/topology.h               |  5 ++
 kernel/sched/topology.c                      | 49 ++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-07-28 19:12 [PATCH net-next V4 0/3] Introduce and use NUMA distance metrics Tariq Toukan
@ 2022-07-28 19:12 ` Tariq Toukan
  2022-07-30 17:29   ` Tariq Toukan
  2022-08-04 17:28   ` Valentin Schneider
  2022-07-28 19:12 ` [PATCH net-next V4 2/3] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 3/3] enic: Use NUMA distances logic when setting " Tariq Toukan
  2 siblings, 2 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-07-28 19:12 UTC (permalink / raw)
  To: David S. Miller, Saeed Mahameed, Jakub Kicinski, Ingo Molnar,
	Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel, Tariq Toukan

Implement and expose API that sets the spread of CPUs based on distance,
given a NUMA node.  Fallback to legacy logic that uses
cpumask_local_spread.

This logic can be used by device drivers to prefer some remote cpus over
others.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/sched/topology.h |  5 ++++
 kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..a49167c2a0e5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -210,6 +210,7 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 # define SD_INIT_NAME(type)
 #endif
 
+void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -231,6 +232,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
+{
+	memset(cpus, 0, ncpus * sizeof(*cpus));
+}
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..157aef862c04 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,8 +2067,57 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
+{
+	cpumask_var_t cpumask;
+	int first, i;
+
+	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return false;
+
+	cpumask_copy(cpumask, cpu_online_mask);
+
+	first = cpumask_first(cpumask_of_node(node));
+
+	for (i = 0; i < ncpus; i++) {
+		int cpu;
+
+		cpu = sched_numa_find_closest(cpumask, first);
+		if (cpu >= nr_cpu_ids) {
+			free_cpumask_var(cpumask);
+			return false;
+		}
+		cpus[i] = cpu;
+		__cpumask_clear_cpu(cpu, cpumask);
+	}
+
+	free_cpumask_var(cpumask);
+	return true;
+}
+#else
+static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
+{
+	return false;
+}
 #endif /* CONFIG_NUMA */
 
+static void sched_cpus_by_local_spread(int node, u16 *cpus, int ncpus)
+{
+	int i;
+
+	for (i = 0; i < ncpus; i++)
+		cpus[i] = cpumask_local_spread(i, node);
+}
+
+void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
+{
+	bool success = sched_cpus_spread_by_distance(node, cpus, ncpus);
+
+	if (!success)
+		sched_cpus_by_local_spread(node, cpus, ncpus);
+}
+EXPORT_SYMBOL_GPL(sched_cpus_set_spread);
+
 static int __sdt_alloc(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next V4 2/3] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints
  2022-07-28 19:12 [PATCH net-next V4 0/3] Introduce and use NUMA distance metrics Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
@ 2022-07-28 19:12 ` Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 3/3] enic: Use NUMA distances logic when setting " Tariq Toukan
  2 siblings, 0 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-07-28 19:12 UTC (permalink / raw)
  To: David S. Miller, Saeed Mahameed, Jakub Kicinski, Ingo Molnar,
	Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel, Tariq Toukan

In the IRQ affinity hints, replace the binary NUMA preference (local /
remote) with an improved API that minds the actual distances, so that
remote NUMAs with short distance are preferred over farther ones.

This has significant performance implications when using NUMA-aware
allocated memory (follow [1] and derivatives for example).

[1]
drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
   int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));

Performance tests:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
|                         | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline                | 52.3      | 6.4 %            | 17.9 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

* CPU util on active cores only.

Setups details (similar for both sides):

NIC: ConnectX6-DX dual port, 100 Gbps each.
Single port used in the tests.

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        16
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             2594.804
BogoMIPS:            4890.73
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7,128-135
NUMA node1 CPU(s):   8-15,136-143
NUMA node2 CPU(s):   16-23,144-151
NUMA node3 CPU(s):   24-31,152-159
NUMA node4 CPU(s):   32-39,160-167
NUMA node5 CPU(s):   40-47,168-175
NUMA node6 CPU(s):   48-55,176-183
NUMA node7 CPU(s):   56-63,184-191
NUMA node8 CPU(s):   64-71,192-199
NUMA node9 CPU(s):   72-79,200-207
NUMA node10 CPU(s):  80-87,208-215
NUMA node11 CPU(s):  88-95,216-223
NUMA node12 CPU(s):  96-103,224-231
NUMA node13 CPU(s):  104-111,232-239
NUMA node14 CPU(s):  112-119,240-247
NUMA node15 CPU(s):  120-127,248-255
..

$ numactl -H
..
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
  0:  10  11  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  1:  11  10  11  11  12  12  12  12  32  32  32  32  32  32  32  32
  2:  11  11  10  11  12  12  12  12  32  32  32  32  32  32  32  32
  3:  11  11  11  10  12  12  12  12  32  32  32  32  32  32  32  32
  4:  12  12  12  12  10  11  11  11  32  32  32  32  32  32  32  32
  5:  12  12  12  12  11  10  11  11  32  32  32  32  32  32  32  32
  6:  12  12  12  12  11  11  10  11  32  32  32  32  32  32  32  32
  7:  12  12  12  12  11  11  11  10  32  32  32  32  32  32  32  32
  8:  32  32  32  32  32  32  32  32  10  11  11  11  12  12  12  12
  9:  32  32  32  32  32  32  32  32  11  10  11  11  12  12  12  12
 10:  32  32  32  32  32  32  32  32  11  11  10  11  12  12  12  12
 11:  32  32  32  32  32  32  32  32  11  11  11  10  12  12  12  12
 12:  32  32  32  32  32  32  32  32  12  12  12  12  10  11  11  11
 13:  32  32  32  32  32  32  32  32  12  12  12  12  11  10  11  11
 14:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  10  11
 15:  32  32  32  32  32  32  32  32  12  12  12  12  11  11  11  10

$ cat /sys/class/net/ens5f0/device/numa_node
14

Affinity hints (127 IRQs):
Before:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
348: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
349: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004
350: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000008
351: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000010
352: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000020
353: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000040
354: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
355: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
356: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000200
357: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000400
358: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000800
359: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00001000
360: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
361: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00004000
362: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00008000
363: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
364: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00020000
365: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00040000
366: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000
367: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00100000
368: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00200000
369: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00400000
370: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00800000
371: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,01000000
372: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,02000000
373: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,04000000
374: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,08000000
375: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,10000000
376: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,20000000
377: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,40000000
378: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,80000000
379: 00000000,00000000,00000000,00000000,00000000,00000000,00000001,00000000
380: 00000000,00000000,00000000,00000000,00000000,00000000,00000002,00000000
381: 00000000,00000000,00000000,00000000,00000000,00000000,00000004,00000000
382: 00000000,00000000,00000000,00000000,00000000,00000000,00000008,00000000
383: 00000000,00000000,00000000,00000000,00000000,00000000,00000010,00000000
384: 00000000,00000000,00000000,00000000,00000000,00000000,00000020,00000000
385: 00000000,00000000,00000000,00000000,00000000,00000000,00000040,00000000
386: 00000000,00000000,00000000,00000000,00000000,00000000,00000080,00000000
387: 00000000,00000000,00000000,00000000,00000000,00000000,00000100,00000000
388: 00000000,00000000,00000000,00000000,00000000,00000000,00000200,00000000
389: 00000000,00000000,00000000,00000000,00000000,00000000,00000400,00000000
390: 00000000,00000000,00000000,00000000,00000000,00000000,00000800,00000000
391: 00000000,00000000,00000000,00000000,00000000,00000000,00001000,00000000
392: 00000000,00000000,00000000,00000000,00000000,00000000,00002000,00000000
393: 00000000,00000000,00000000,00000000,00000000,00000000,00004000,00000000
394: 00000000,00000000,00000000,00000000,00000000,00000000,00008000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000000,00010000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000000,00020000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000000,00040000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000000,00080000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000000,00100000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000000,00200000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000000,00400000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000000,00800000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000000,01000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000000,02000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000000,04000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000000,08000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00000000,10000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00000000,20000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00000000,40000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00000000,80000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
427: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
428: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
429: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
430: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
431: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
432: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
433: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
434: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
435: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
436: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
437: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
438: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
439: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
440: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
441: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
442: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
443: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
444: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
445: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
446: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
447: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
448: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
449: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
450: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
451: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
452: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
453: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
454: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
455: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
456: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
457: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000

After:
331: 00000000,00000000,00000000,00000000,00010000,00000000,00000000,00000000
332: 00000000,00000000,00000000,00000000,00020000,00000000,00000000,00000000
333: 00000000,00000000,00000000,00000000,00040000,00000000,00000000,00000000
334: 00000000,00000000,00000000,00000000,00080000,00000000,00000000,00000000
335: 00000000,00000000,00000000,00000000,00100000,00000000,00000000,00000000
336: 00000000,00000000,00000000,00000000,00200000,00000000,00000000,00000000
337: 00000000,00000000,00000000,00000000,00400000,00000000,00000000,00000000
338: 00000000,00000000,00000000,00000000,00800000,00000000,00000000,00000000
339: 00010000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
340: 00020000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
341: 00040000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
342: 00080000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
343: 00100000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
344: 00200000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
345: 00400000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
346: 00800000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
347: 00000000,00000000,00000000,00000000,00000001,00000000,00000000,00000000
348: 00000000,00000000,00000000,00000000,00000002,00000000,00000000,00000000
349: 00000000,00000000,00000000,00000000,00000004,00000000,00000000,00000000
350: 00000000,00000000,00000000,00000000,00000008,00000000,00000000,00000000
351: 00000000,00000000,00000000,00000000,00000010,00000000,00000000,00000000
352: 00000000,00000000,00000000,00000000,00000020,00000000,00000000,00000000
353: 00000000,00000000,00000000,00000000,00000040,00000000,00000000,00000000
354: 00000000,00000000,00000000,00000000,00000080,00000000,00000000,00000000
355: 00000000,00000000,00000000,00000000,00000100,00000000,00000000,00000000
356: 00000000,00000000,00000000,00000000,00000200,00000000,00000000,00000000
357: 00000000,00000000,00000000,00000000,00000400,00000000,00000000,00000000
358: 00000000,00000000,00000000,00000000,00000800,00000000,00000000,00000000
359: 00000000,00000000,00000000,00000000,00001000,00000000,00000000,00000000
360: 00000000,00000000,00000000,00000000,00002000,00000000,00000000,00000000
361: 00000000,00000000,00000000,00000000,00004000,00000000,00000000,00000000
362: 00000000,00000000,00000000,00000000,00008000,00000000,00000000,00000000
363: 00000000,00000000,00000000,00000000,01000000,00000000,00000000,00000000
364: 00000000,00000000,00000000,00000000,02000000,00000000,00000000,00000000
365: 00000000,00000000,00000000,00000000,04000000,00000000,00000000,00000000
366: 00000000,00000000,00000000,00000000,08000000,00000000,00000000,00000000
367: 00000000,00000000,00000000,00000000,10000000,00000000,00000000,00000000
368: 00000000,00000000,00000000,00000000,20000000,00000000,00000000,00000000
369: 00000000,00000000,00000000,00000000,40000000,00000000,00000000,00000000
370: 00000000,00000000,00000000,00000000,80000000,00000000,00000000,00000000
371: 00000001,00000000,00000000,00000000,00000000,00000000,00000000,00000000
372: 00000002,00000000,00000000,00000000,00000000,00000000,00000000,00000000
373: 00000004,00000000,00000000,00000000,00000000,00000000,00000000,00000000
374: 00000008,00000000,00000000,00000000,00000000,00000000,00000000,00000000
375: 00000010,00000000,00000000,00000000,00000000,00000000,00000000,00000000
376: 00000020,00000000,00000000,00000000,00000000,00000000,00000000,00000000
377: 00000040,00000000,00000000,00000000,00000000,00000000,00000000,00000000
378: 00000080,00000000,00000000,00000000,00000000,00000000,00000000,00000000
379: 00000100,00000000,00000000,00000000,00000000,00000000,00000000,00000000
380: 00000200,00000000,00000000,00000000,00000000,00000000,00000000,00000000
381: 00000400,00000000,00000000,00000000,00000000,00000000,00000000,00000000
382: 00000800,00000000,00000000,00000000,00000000,00000000,00000000,00000000
383: 00001000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
384: 00002000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
385: 00004000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
386: 00008000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
387: 01000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
388: 02000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
389: 04000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
390: 08000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
391: 10000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
392: 20000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
393: 40000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
394: 80000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
395: 00000000,00000000,00000000,00000000,00000000,00000001,00000000,00000000
396: 00000000,00000000,00000000,00000000,00000000,00000002,00000000,00000000
397: 00000000,00000000,00000000,00000000,00000000,00000004,00000000,00000000
398: 00000000,00000000,00000000,00000000,00000000,00000008,00000000,00000000
399: 00000000,00000000,00000000,00000000,00000000,00000010,00000000,00000000
400: 00000000,00000000,00000000,00000000,00000000,00000020,00000000,00000000
401: 00000000,00000000,00000000,00000000,00000000,00000040,00000000,00000000
402: 00000000,00000000,00000000,00000000,00000000,00000080,00000000,00000000
403: 00000000,00000000,00000000,00000000,00000000,00000100,00000000,00000000
404: 00000000,00000000,00000000,00000000,00000000,00000200,00000000,00000000
405: 00000000,00000000,00000000,00000000,00000000,00000400,00000000,00000000
406: 00000000,00000000,00000000,00000000,00000000,00000800,00000000,00000000
407: 00000000,00000000,00000000,00000000,00000000,00001000,00000000,00000000
408: 00000000,00000000,00000000,00000000,00000000,00002000,00000000,00000000
409: 00000000,00000000,00000000,00000000,00000000,00004000,00000000,00000000
410: 00000000,00000000,00000000,00000000,00000000,00008000,00000000,00000000
411: 00000000,00000000,00000000,00000000,00000000,00010000,00000000,00000000
412: 00000000,00000000,00000000,00000000,00000000,00020000,00000000,00000000
413: 00000000,00000000,00000000,00000000,00000000,00040000,00000000,00000000
414: 00000000,00000000,00000000,00000000,00000000,00080000,00000000,00000000
415: 00000000,00000000,00000000,00000000,00000000,00100000,00000000,00000000
416: 00000000,00000000,00000000,00000000,00000000,00200000,00000000,00000000
417: 00000000,00000000,00000000,00000000,00000000,00400000,00000000,00000000
418: 00000000,00000000,00000000,00000000,00000000,00800000,00000000,00000000
419: 00000000,00000000,00000000,00000000,00000000,01000000,00000000,00000000
420: 00000000,00000000,00000000,00000000,00000000,02000000,00000000,00000000
421: 00000000,00000000,00000000,00000000,00000000,04000000,00000000,00000000
422: 00000000,00000000,00000000,00000000,00000000,08000000,00000000,00000000
423: 00000000,00000000,00000000,00000000,00000000,10000000,00000000,00000000
424: 00000000,00000000,00000000,00000000,00000000,20000000,00000000,00000000
425: 00000000,00000000,00000000,00000000,00000000,40000000,00000000,00000000
426: 00000000,00000000,00000000,00000000,00000000,80000000,00000000,00000000
427: 00000000,00000001,00000000,00000000,00000000,00000000,00000000,00000000
428: 00000000,00000002,00000000,00000000,00000000,00000000,00000000,00000000
429: 00000000,00000004,00000000,00000000,00000000,00000000,00000000,00000000
430: 00000000,00000008,00000000,00000000,00000000,00000000,00000000,00000000
431: 00000000,00000010,00000000,00000000,00000000,00000000,00000000,00000000
432: 00000000,00000020,00000000,00000000,00000000,00000000,00000000,00000000
433: 00000000,00000040,00000000,00000000,00000000,00000000,00000000,00000000
434: 00000000,00000080,00000000,00000000,00000000,00000000,00000000,00000000
435: 00000000,00000100,00000000,00000000,00000000,00000000,00000000,00000000
436: 00000000,00000200,00000000,00000000,00000000,00000000,00000000,00000000
437: 00000000,00000400,00000000,00000000,00000000,00000000,00000000,00000000
438: 00000000,00000800,00000000,00000000,00000000,00000000,00000000,00000000
439: 00000000,00001000,00000000,00000000,00000000,00000000,00000000,00000000
440: 00000000,00002000,00000000,00000000,00000000,00000000,00000000,00000000
441: 00000000,00004000,00000000,00000000,00000000,00000000,00000000,00000000
442: 00000000,00008000,00000000,00000000,00000000,00000000,00000000,00000000
443: 00000000,00010000,00000000,00000000,00000000,00000000,00000000,00000000
444: 00000000,00020000,00000000,00000000,00000000,00000000,00000000,00000000
445: 00000000,00040000,00000000,00000000,00000000,00000000,00000000,00000000
446: 00000000,00080000,00000000,00000000,00000000,00000000,00000000,00000000
447: 00000000,00100000,00000000,00000000,00000000,00000000,00000000,00000000
448: 00000000,00200000,00000000,00000000,00000000,00000000,00000000,00000000
449: 00000000,00400000,00000000,00000000,00000000,00000000,00000000,00000000
450: 00000000,00800000,00000000,00000000,00000000,00000000,00000000,00000000
451: 00000000,01000000,00000000,00000000,00000000,00000000,00000000,00000000
452: 00000000,02000000,00000000,00000000,00000000,00000000,00000000,00000000
453: 00000000,04000000,00000000,00000000,00000000,00000000,00000000,00000000
454: 00000000,08000000,00000000,00000000,00000000,00000000,00000000,00000000
455: 00000000,10000000,00000000,00000000,00000000,00000000,00000000,00000000
456: 00000000,20000000,00000000,00000000,00000000,00000000,00000000,00000000
457: 00000000,40000000,00000000,00000000,00000000,00000000,00000000,00000000

Reviewed-by: Gal Pressman <gal@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..e78fb82d5be8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -11,6 +11,7 @@
 #ifdef CONFIG_RFS_ACCEL
 #include <linux/cpu_rmap.h>
 #endif
+#include <linux/sched/topology.h>
 #include "mlx5_core.h"
 #include "lib/eq.h"
 #include "fpga/core.h"
@@ -812,7 +813,6 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 	int ncomp_eqs = table->num_comp_eqs;
 	u16 *cpus;
 	int ret;
-	int i;
 
 	ncomp_eqs = table->num_comp_eqs;
 	table->comp_irqs = kcalloc(ncomp_eqs, sizeof(*table->comp_irqs), GFP_KERNEL);
@@ -830,8 +830,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+	sched_cpus_set_spread(dev->priv.numa_node, cpus, ncomp_eqs);
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
 	kfree(cpus);
 	if (ret < 0)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH net-next V4 3/3] enic: Use NUMA distances logic when setting affinity hints
  2022-07-28 19:12 [PATCH net-next V4 0/3] Introduce and use NUMA distance metrics Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
  2022-07-28 19:12 ` [PATCH net-next V4 2/3] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Tariq Toukan
@ 2022-07-28 19:12 ` Tariq Toukan
  2 siblings, 0 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-07-28 19:12 UTC (permalink / raw)
  To: David S. Miller, Saeed Mahameed, Jakub Kicinski, Ingo Molnar,
	Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel, Tariq Toukan, Christian Benvenuti,
	Govindarajulu Varadarajan

Use the new CPU spread API to sort cpus preference of remote NUMA nodes
according to their distance.

Cc: Christian Benvenuti <benve@cisco.com>
Cc: Govindarajulu Varadarajan <_govind@gmx.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/cisco/enic/enic_main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index 372fb7b3a282..9de3c3ffa1e3 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -44,6 +44,7 @@
 #include <linux/cpu_rmap.h>
 #endif
 #include <linux/crash_dump.h>
+#include <linux/sched/topology.h>
 #include <net/busy_poll.h>
 #include <net/vxlan.h>
 
@@ -114,8 +115,14 @@ static struct enic_intr_mod_range mod_range[ENIC_MAX_LINK_SPEEDS] = {
 static void enic_init_affinity_hint(struct enic *enic)
 {
 	int numa_node = dev_to_node(&enic->pdev->dev);
+	u16 *cpus;
 	int i;
 
+	cpus = kcalloc(enic->intr_count, sizeof(*cpus), GFP_KERNEL);
+	if (!cpus)
+		return;
+
+	sched_cpus_set_spread(numa_node, cpus, enic->intr_count);
 	for (i = 0; i < enic->intr_count; i++) {
 		if (enic_is_err_intr(enic, i) || enic_is_notify_intr(enic, i) ||
 		    (cpumask_available(enic->msix[i].affinity_mask) &&
@@ -123,9 +130,10 @@ static void enic_init_affinity_hint(struct enic *enic)
 			continue;
 		if (zalloc_cpumask_var(&enic->msix[i].affinity_mask,
 				       GFP_KERNEL))
-			cpumask_set_cpu(cpumask_local_spread(i, numa_node),
+			cpumask_set_cpu(cpus[i],
 					enic->msix[i].affinity_mask);
 	}
+	kfree(cpus);
 }
 
 static void enic_free_affinity_hint(struct enic *enic)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
@ 2022-07-30 17:29   ` Tariq Toukan
  2022-08-02  6:40     ` Tariq Toukan
  2022-08-04 17:28   ` Valentin Schneider
  1 sibling, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-07-30 17:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Daniel Bristot de Oliveira, Mel Gorman, Dietmar Eggemann,
	Vincent Guittot, Ben Segall, Steven Rostedt, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, linux-kernel,
	Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski



On 7/28/2022 10:12 PM, Tariq Toukan wrote:
> Implement and expose API that sets the spread of CPUs based on distance,
> given a NUMA node.  Fallback to legacy logic that uses
> cpumask_local_spread.
> 
> This logic can be used by device drivers to prefer some remote cpus over
> others.
> 
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   include/linux/sched/topology.h |  5 ++++
>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>   2 files changed, 54 insertions(+)
> 

++

Dear SCHEDULER maintainers,

V3 of my series was submitted ~12 days ago and had significant changes.
I'd appreciate your review to this patch, so we could make it to the 
upcoming kernel.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-07-30 17:29   ` Tariq Toukan
@ 2022-08-02  6:40     ` Tariq Toukan
  2022-08-02  9:38       ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-02  6:40 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Valentin Schneider,
	Daniel Bristot de Oliveira, Mel Gorman, Dietmar Eggemann,
	Vincent Guittot, Ben Segall, Steven Rostedt
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, linux-kernel,
	Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski



On 7/30/2022 8:29 PM, Tariq Toukan wrote:
> 
> 
> On 7/28/2022 10:12 PM, Tariq Toukan wrote:
>> Implement and expose API that sets the spread of CPUs based on distance,
>> given a NUMA node.  Fallback to legacy logic that uses
>> cpumask_local_spread.
>>
>> This logic can be used by device drivers to prefer some remote cpus over
>> others.
>>
>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>   include/linux/sched/topology.h |  5 ++++
>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
> 
> ++
> 
> Dear SCHEDULER maintainers,
> 
> V3 of my series was submitted ~12 days ago and had significant changes.
> I'd appreciate your review to this patch, so we could make it to the 
> upcoming kernel.
> 
> Regards,
> Tariq

Hi,
Another reminder.
Do you have any comments on this patch?
If not, please provide your Ack.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-02  6:40     ` Tariq Toukan
@ 2022-08-02  9:38       ` Valentin Schneider
  2022-08-02 16:05         ` Jakub Kicinski
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-02  9:38 UTC (permalink / raw)
  To: Tariq Toukan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Daniel Bristot de Oliveira, Mel Gorman, Dietmar Eggemann,
	Vincent Guittot, Ben Segall, Steven Rostedt
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, linux-kernel,
	Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski

On 02/08/22 09:40, Tariq Toukan wrote:
> On 7/30/2022 8:29 PM, Tariq Toukan wrote:
>>
>>
>> On 7/28/2022 10:12 PM, Tariq Toukan wrote:
>>> Implement and expose API that sets the spread of CPUs based on distance,
>>> given a NUMA node.  Fallback to legacy logic that uses
>>> cpumask_local_spread.
>>>
>>> This logic can be used by device drivers to prefer some remote cpus over
>>> others.
>>>
>>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>>> ---
>>>   include/linux/sched/topology.h |  5 ++++
>>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>>   2 files changed, 54 insertions(+)
>>>
>>
>> ++
>>
>> Dear SCHEDULER maintainers,
>>
>> V3 of my series was submitted ~12 days ago and had significant changes.
>> I'd appreciate your review to this patch, so we could make it to the
>> upcoming kernel.
>>
>> Regards,
>> Tariq
>
> Hi,
> Another reminder.
> Do you have any comments on this patch?
> If not, please provide your Ack.

It's not even been a week since you submitted v4 (and ~3 days since you
last pinged this thread), and not all of us are limitless reviewing
machines :-)

This is already in my todo-list, but isn't the topmost item yet.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-02  9:38       ` Valentin Schneider
@ 2022-08-02 16:05         ` Jakub Kicinski
  0 siblings, 0 replies; 27+ messages in thread
From: Jakub Kicinski @ 2022-08-02 16:05 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Tariq Toukan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Daniel Bristot de Oliveira, Mel Gorman, Dietmar Eggemann,
	Vincent Guittot, Ben Segall, Steven Rostedt, Eric Dumazet,
	Paolo Abeni, netdev, Gal Pressman, linux-kernel, Tariq Toukan,
	David S. Miller, Saeed Mahameed

On Tue, 02 Aug 2022 10:38:08 +0100 Valentin Schneider wrote:
> It's not even been a week since you submitted v4 (and ~3 days since you
> last pinged this thread), and not all of us are limitless reviewing
> machines :-)
> 
> This is already in my todo-list, but isn't the topmost item yet.

I'd appreciate a review on this one soonish (as a favor, don't read this
as a passive aggressive reprimand).

Tariq got a review on a trivial export patch, which put all the logic 
in the driver instead, rather promptly. I asked them to go the extra
mile and move the code to the core so other drivers can benefit.

If this doesn't get into 6.0 it'll be a clear sign for driver
maintainers that building shared infrastructure is arduous and should
be avoided at all cost :(

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
  2022-07-30 17:29   ` Tariq Toukan
@ 2022-08-04 17:28   ` Valentin Schneider
  2022-08-08 14:39     ` Tariq Toukan
  1 sibling, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-04 17:28 UTC (permalink / raw)
  To: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel, Tariq Toukan

On 28/07/22 22:12, Tariq Toukan wrote:
> Implement and expose API that sets the spread of CPUs based on distance,
> given a NUMA node.  Fallback to legacy logic that uses
> cpumask_local_spread.
>
> This logic can be used by device drivers to prefer some remote cpus over
> others.
>
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

IIUC you want a multi-CPU version of sched_numa_find_closest(). I'm OK with
the need (and you have the numbers to back it up), but I have some qualms
with the implementation, see more below.

> ---
>  include/linux/sched/topology.h |  5 ++++
>  kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..a49167c2a0e5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -210,6 +210,7 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
>  # define SD_INIT_NAME(type)
>  #endif
>
> +void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
>  #else /* CONFIG_SMP */
>
>  struct sched_domain_attr;
> @@ -231,6 +232,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
>       return true;
>  }
>
> +static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
> +{
> +	memset(cpus, 0, ncpus * sizeof(*cpus));
> +}
>  #endif	/* !CONFIG_SMP */
>
>  #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 05b6c2ad90b9..157aef862c04 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,8 +2067,57 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>       return found;
>  }
>
> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
                                                       ^^^^^^^^^
That should be a struct *cpumask.

> +{
> +	cpumask_var_t cpumask;
> +	int first, i;
> +
> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> +		return false;
> +
> +	cpumask_copy(cpumask, cpu_online_mask);
> +
> +	first = cpumask_first(cpumask_of_node(node));
> +
> +	for (i = 0; i < ncpus; i++) {
> +		int cpu;
> +
> +		cpu = sched_numa_find_closest(cpumask, first);
> +		if (cpu >= nr_cpu_ids) {
> +			free_cpumask_var(cpumask);
> +			return false;
> +		}
> +		cpus[i] = cpu;
> +		__cpumask_clear_cpu(cpu, cpumask);
> +	}
> +
> +	free_cpumask_var(cpumask);
> +	return true;
> +}

This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
would make more sense to set *up to* ncpus, the calling code can then
decide if getting fewer than requested is OK or not.

I also don't get the fallback to cpumask_local_spread(), is that if the
NUMA topology hasn't been initialized yet? It feels like most users of this
would invoke it late enough (i.e. anything after early initcalls) to have
the backing data available.

Finally, I think iterating only once per NUMA level would make more sense.

I've scribbled something together from those thoughts, see below. This has
just the mlx5 bits touched to show what I mean, but that's just compile
tested.
---
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..2d010d8d670c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -810,7 +810,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 {
 	struct mlx5_eq_table *table = dev->priv.eq_table;
 	int ncomp_eqs = table->num_comp_eqs;
-	u16 *cpus;
+	cpumask_var_t cpus;
 	int ret;
 	int i;
 
@@ -825,15 +825,14 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		return ret;
 	}
 
-	cpus = kcalloc(ncomp_eqs, sizeof(*cpus), GFP_KERNEL);
-	if (!cpus) {
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) {
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	sched_numa_find_n_closest(cpus, dev->piv.numa_node, ncomp_eqs);
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
-	kfree(cpus);
+	free_cpumask_var(cpus);
 	if (ret < 0)
 		goto free_irqs;
 	return ret;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 662f1d55e30e..2330f81aeede 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -448,7 +448,7 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
 /**
  * mlx5_irqs_request_vectors - request one or more IRQs for mlx5 device.
  * @dev: mlx5 device that is requesting the IRQs.
- * @cpus: CPUs array for binding the IRQs
+ * @cpus: cpumask for binding the IRQs
  * @nirqs: number of IRQs to request.
  * @irqs: an output array of IRQs pointers.
  *
@@ -458,25 +458,22 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
  * This function returns the number of IRQs requested, (which might be smaller than
  * @nirqs), if successful, or a negative error code in case of an error.
  */
-int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
+int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev,
+			      const struct cpumask *cpus,
+			      int nirqs,
 			      struct mlx5_irq **irqs)
 {
-	cpumask_var_t req_mask;
+	int cpu = cpumask_first(cpus);
 	struct mlx5_irq *irq;
-	int i;
 
-	if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-		return -ENOMEM;
-	for (i = 0; i < nirqs; i++) {
-		cpumask_set_cpu(cpus[i], req_mask);
-		irq = mlx5_irq_request(dev, i, req_mask);
+	for (i = 0; i < nirqs && cpu < nr_cpu_ids; i++) {
+		irq = mlx5_irq_request(dev, i, cpumask_of(cpu));
 		if (IS_ERR(irq))
 			break;
-		cpumask_clear(req_mask);
 		irqs[i] = irq;
+		cpu = cpumask_next(cpu, cpus);
 	}
 
-	free_cpumask_var(req_mask);
 	return i ? i : PTR_ERR(irq);
 }
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..bdc9c5df84cd 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus);
+#else
+static inline int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
+{
+	return -ENOTSUPP;
+}
+#endif
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..499f6ef611fa 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,56 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_find_n_closest - Find the 'n' closest cpus to a given node
+ * @cpus: The cpumask to fill in with CPUs
+ * @ncpus: How many CPUs to look for
+ * @node: The node to start the search from
+ *
+ * This will fill *up to* @ncpus in @cpus, using the closest (in NUMA distance)
+ * first and expanding outside the node if more CPUs are required.
+ *
+ * Return: Number of found CPUs, negative value on error.
+ */
+int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
+{
+	struct cpumask ***masks;
+	int cpu, lvl, ntofind = ncpus;
+
+	if (node >= nr_node_ids)
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	masks = rcu_dereference(sched_domains_numa_masks);
+	if (!masks)
+		goto unlock;
+
+	/*
+	 * Walk up the level masks; the first mask should be CPUs LOCAL_DISTANCE
+	 * away (aka the local node), and we incrementally grow the search
+	 * beyond that.
+	 */
+	for (lvl = 0; lvl < sched_domains_numa_levels; lvl++) {
+		if (!masks[lvl][node])
+			goto unlock;
+
+		/* XXX: could be neater with for_each_cpu_andnot() */
+		for_each_cpu(cpu, masks[lvl][node]) {
+			if (cpumask_test_cpu(cpu, cpus))
+				continue;
+
+			__cpumask_set_cpu(cpu, cpus);
+			if (--ntofind == 0)
+				goto unlock;
+		}
+	}
+unlock:
+	rcu_read_unlock();
+	return ncpus - ntofind;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_n_closest);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-04 17:28   ` Valentin Schneider
@ 2022-08-08 14:39     ` Tariq Toukan
  2022-08-09 10:02       ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-08 14:39 UTC (permalink / raw)
  To: Valentin Schneider, Tariq Toukan, David S. Miller,
	Saeed Mahameed, Jakub Kicinski, Ingo Molnar, Peter Zijlstra,
	Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel



On 8/4/2022 8:28 PM, Valentin Schneider wrote:
> On 28/07/22 22:12, Tariq Toukan wrote:
>> Implement and expose API that sets the spread of CPUs based on distance,
>> given a NUMA node.  Fallback to legacy logic that uses
>> cpumask_local_spread.
>>
>> This logic can be used by device drivers to prefer some remote cpus over
>> others.
>>
>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> 
> IIUC you want a multi-CPU version of sched_numa_find_closest(). I'm OK with
> the need (and you have the numbers to back it up), but I have some qualms
> with the implementation, see more below.
> 

I want a sorted multi-CPU version.

>> ---
>>   include/linux/sched/topology.h |  5 ++++
>>   kernel/sched/topology.c        | 49 ++++++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..a49167c2a0e5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -210,6 +210,7 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
>>   # define SD_INIT_NAME(type)
>>   #endif
>>
>> +void sched_cpus_set_spread(int node, u16 *cpus, int ncpus);
>>   #else /* CONFIG_SMP */
>>
>>   struct sched_domain_attr;
>> @@ -231,6 +232,10 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
>>        return true;
>>   }
>>
>> +static inline void sched_cpus_set_spread(int node, u16 *cpus, int ncpus)
>> +{
>> +	memset(cpus, 0, ncpus * sizeof(*cpus));
>> +}
>>   #endif	/* !CONFIG_SMP */
>>
>>   #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..157aef862c04 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2067,8 +2067,57 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>>        return found;
>>   }
>>
>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>                                                         ^^^^^^^^^
> That should be a struct *cpumask.

With cpumask, we'll lose the order.

> 
>> +{
>> +	cpumask_var_t cpumask;
>> +	int first, i;
>> +
>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>> +		return false;
>> +
>> +	cpumask_copy(cpumask, cpu_online_mask);
>> +
>> +	first = cpumask_first(cpumask_of_node(node));
>> +
>> +	for (i = 0; i < ncpus; i++) {
>> +		int cpu;
>> +
>> +		cpu = sched_numa_find_closest(cpumask, first);
>> +		if (cpu >= nr_cpu_ids) {
>> +			free_cpumask_var(cpumask);
>> +			return false;
>> +		}
>> +		cpus[i] = cpu;
>> +		__cpumask_clear_cpu(cpu, cpumask);
>> +	}
>> +
>> +	free_cpumask_var(cpumask);
>> +	return true;
>> +}
> 
> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
> would make more sense to set *up to* ncpus, the calling code can then
> decide if getting fewer than requested is OK or not.
> 
> I also don't get the fallback to cpumask_local_spread(), is that if the
> NUMA topology hasn't been initialized yet? It feels like most users of this
> would invoke it late enough (i.e. anything after early initcalls) to have
> the backing data available.

I don't expect this to fail, as we invoke it late enough. Fallback is 
there just in case, to preserve the old behavior instead of getting 
totally broken.

> 
> Finally, I think iterating only once per NUMA level would make more sense.

Agree, although it's just a setup stage.
I'll check if it can work for me, based on the reference code below.

> 
> I've scribbled something together from those thoughts, see below. This has
> just the mlx5 bits touched to show what I mean, but that's just compile
> tested.

My function returns a *sorted* list of the N closest cpus.
That is important. In many cases, drivers do not need all N irqs, but 
only a portion of it, so it wants to use the closest subset of cpus.

IIUC, the code below relaxes this and returns the set of N closest cpus, 
unsorted.


> ---
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 229728c80233..2d010d8d670c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -810,7 +810,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   {
>   	struct mlx5_eq_table *table = dev->priv.eq_table;
>   	int ncomp_eqs = table->num_comp_eqs;
> -	u16 *cpus;
> +	cpumask_var_t cpus;
>   	int ret;
>   	int i;
>   
> @@ -825,15 +825,14 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   		return ret;
>   	}
>   
> -	cpus = kcalloc(ncomp_eqs, sizeof(*cpus), GFP_KERNEL);
> -	if (!cpus) {
> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) {
>   		ret = -ENOMEM;
>   		goto free_irqs;
>   	}
> -	for (i = 0; i < ncomp_eqs; i++)
> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> +	sched_numa_find_n_closest(cpus, dev->piv.numa_node, ncomp_eqs);
>   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
> -	kfree(cpus);
> +	free_cpumask_var(cpus);
>   	if (ret < 0)
>   		goto free_irqs;
>   	return ret;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> index 662f1d55e30e..2330f81aeede 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
> @@ -448,7 +448,7 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
>   /**
>    * mlx5_irqs_request_vectors - request one or more IRQs for mlx5 device.
>    * @dev: mlx5 device that is requesting the IRQs.
> - * @cpus: CPUs array for binding the IRQs
> + * @cpus: cpumask for binding the IRQs
>    * @nirqs: number of IRQs to request.
>    * @irqs: an output array of IRQs pointers.
>    *
> @@ -458,25 +458,22 @@ void mlx5_irqs_release_vectors(struct mlx5_irq **irqs, int nirqs)
>    * This function returns the number of IRQs requested, (which might be smaller than
>    * @nirqs), if successful, or a negative error code in case of an error.
>    */
> -int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev, u16 *cpus, int nirqs,
> +int mlx5_irqs_request_vectors(struct mlx5_core_dev *dev,
> +			      const struct cpumask *cpus,
> +			      int nirqs,
>   			      struct mlx5_irq **irqs)
>   {
> -	cpumask_var_t req_mask;
> +	int cpu = cpumask_first(cpus);
>   	struct mlx5_irq *irq;
> -	int i;
>   
> -	if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
> -		return -ENOMEM;
> -	for (i = 0; i < nirqs; i++) {
> -		cpumask_set_cpu(cpus[i], req_mask);
> -		irq = mlx5_irq_request(dev, i, req_mask);
> +	for (i = 0; i < nirqs && cpu < nr_cpu_ids; i++) {
> +		irq = mlx5_irq_request(dev, i, cpumask_of(cpu));
>   		if (IS_ERR(irq))
>   			break;
> -		cpumask_clear(req_mask);
>   		irqs[i] = irq;
> +		cpu = cpumask_next(cpu, cpus);
>   	}
>   
> -	free_cpumask_var(req_mask);
>   	return i ? i : PTR_ERR(irq);
>   }
>   
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..bdc9c5df84cd 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus);
> +#else
> +static inline int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
> +{
> +	return -ENOTSUPP;
> +}
> +#endif
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..499f6ef611fa 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,56 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_find_n_closest - Find the 'n' closest cpus to a given node
> + * @cpus: The cpumask to fill in with CPUs
> + * @ncpus: How many CPUs to look for
> + * @node: The node to start the search from
> + *
> + * This will fill *up to* @ncpus in @cpus, using the closest (in NUMA distance)
> + * first and expanding outside the node if more CPUs are required.
> + *
> + * Return: Number of found CPUs, negative value on error.
> + */
> +int sched_numa_find_n_closest(struct cpumask *cpus, int node, int ncpus)
> +{
> +	struct cpumask ***masks;
> +	int cpu, lvl, ntofind = ncpus;
> +
> +	if (node >= nr_node_ids)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +
> +	masks = rcu_dereference(sched_domains_numa_masks);
> +	if (!masks)
> +		goto unlock;
> +
> +	/*
> +	 * Walk up the level masks; the first mask should be CPUs LOCAL_DISTANCE
> +	 * away (aka the local node), and we incrementally grow the search
> +	 * beyond that.
> +	 */
> +	for (lvl = 0; lvl < sched_domains_numa_levels; lvl++) {
> +		if (!masks[lvl][node])
> +			goto unlock;
> +
> +		/* XXX: could be neater with for_each_cpu_andnot() */
> +		for_each_cpu(cpu, masks[lvl][node]) {
> +			if (cpumask_test_cpu(cpu, cpus))
> +				continue;
> +
> +			__cpumask_set_cpu(cpu, cpus);
> +			if (--ntofind == 0)
> +				goto unlock;
> +		}
> +	}
> +unlock:
> +	rcu_read_unlock();
> +	return ncpus - ntofind;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_find_n_closest);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-08 14:39     ` Tariq Toukan
@ 2022-08-09 10:02       ` Valentin Schneider
  2022-08-09 10:18         ` Tariq Toukan
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-09 10:02 UTC (permalink / raw)
  To: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel

On 08/08/22 17:39, Tariq Toukan wrote:
> On 8/4/2022 8:28 PM, Valentin Schneider wrote:
>> On 28/07/22 22:12, Tariq Toukan wrote:
>>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>>                                                         ^^^^^^^^^
>> That should be a struct *cpumask.
>
> With cpumask, we'll lose the order.
>

Right, I didn't get that from the changelog.

>>
>>> +{
>>> +	cpumask_var_t cpumask;
>>> +	int first, i;
>>> +
>>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>>> +		return false;
>>> +
>>> +	cpumask_copy(cpumask, cpu_online_mask);
>>> +
>>> +	first = cpumask_first(cpumask_of_node(node));
>>> +
>>> +	for (i = 0; i < ncpus; i++) {
>>> +		int cpu;
>>> +
>>> +		cpu = sched_numa_find_closest(cpumask, first);
>>> +		if (cpu >= nr_cpu_ids) {
>>> +			free_cpumask_var(cpumask);
>>> +			return false;
>>> +		}
>>> +		cpus[i] = cpu;
>>> +		__cpumask_clear_cpu(cpu, cpumask);
>>> +	}
>>> +
>>> +	free_cpumask_var(cpumask);
>>> +	return true;
>>> +}
>>
>> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
>> would make more sense to set *up to* ncpus, the calling code can then
>> decide if getting fewer than requested is OK or not.
>>
>> I also don't get the fallback to cpumask_local_spread(), is that if the
>> NUMA topology hasn't been initialized yet? It feels like most users of this
>> would invoke it late enough (i.e. anything after early initcalls) to have
>> the backing data available.
>
> I don't expect this to fail, as we invoke it late enough. Fallback is
> there just in case, to preserve the old behavior instead of getting
> totally broken.
>

Then there shouldn't be a fallback method - the main method is expected to
work.

>>
>> Finally, I think iterating only once per NUMA level would make more sense.
>
> Agree, although it's just a setup stage.
> I'll check if it can work for me, based on the reference code below.
>
>>
>> I've scribbled something together from those thoughts, see below. This has
>> just the mlx5 bits touched to show what I mean, but that's just compile
>> tested.
>
> My function returns a *sorted* list of the N closest cpus.
> That is important. In many cases, drivers do not need all N irqs, but
> only a portion of it, so it wants to use the closest subset of cpus.
>

Are there cases where we can't figure this out in advance? From what I grok
out of the two callsites you patched, all vectors will be used unless some
error happens, so compressing the CPUs in a single cpumask seemed
sufficient.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-09 10:02       ` Valentin Schneider
@ 2022-08-09 10:18         ` Tariq Toukan
  2022-08-09 12:52           ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-09 10:18 UTC (permalink / raw)
  To: Valentin Schneider, Tariq Toukan, David S. Miller,
	Saeed Mahameed, Jakub Kicinski, Ingo Molnar, Peter Zijlstra,
	Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel



On 8/9/2022 1:02 PM, Valentin Schneider wrote:
> On 08/08/22 17:39, Tariq Toukan wrote:
>> On 8/4/2022 8:28 PM, Valentin Schneider wrote:
>>> On 28/07/22 22:12, Tariq Toukan wrote:
>>>> +static bool sched_cpus_spread_by_distance(int node, u16 *cpus, int ncpus)
>>>                                                          ^^^^^^^^^
>>> That should be a struct *cpumask.
>>
>> With cpumask, we'll lose the order.
>>
> 
> Right, I didn't get that from the changelog.

I'll make it clear when re-spinned.

> 
>>>
>>>> +{
>>>> +	cpumask_var_t cpumask;
>>>> +	int first, i;
>>>> +
>>>> +	if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
>>>> +		return false;
>>>> +
>>>> +	cpumask_copy(cpumask, cpu_online_mask);
>>>> +
>>>> +	first = cpumask_first(cpumask_of_node(node));
>>>> +
>>>> +	for (i = 0; i < ncpus; i++) {
>>>> +		int cpu;
>>>> +
>>>> +		cpu = sched_numa_find_closest(cpumask, first);
>>>> +		if (cpu >= nr_cpu_ids) {
>>>> +			free_cpumask_var(cpumask);
>>>> +			return false;
>>>> +		}
>>>> +		cpus[i] = cpu;
>>>> +		__cpumask_clear_cpu(cpu, cpumask);
>>>> +	}
>>>> +
>>>> +	free_cpumask_var(cpumask);
>>>> +	return true;
>>>> +}
>>>
>>> This will fail if ncpus > nr_cpu_ids, which shouldn't be a problem. It
>>> would make more sense to set *up to* ncpus, the calling code can then
>>> decide if getting fewer than requested is OK or not.
>>>
>>> I also don't get the fallback to cpumask_local_spread(), is that if the
>>> NUMA topology hasn't been initialized yet? It feels like most users of this
>>> would invoke it late enough (i.e. anything after early initcalls) to have
>>> the backing data available.
>>
>> I don't expect this to fail, as we invoke it late enough. Fallback is
>> there just in case, to preserve the old behavior instead of getting
>> totally broken.
>>
> 
> Then there shouldn't be a fallback method - the main method is expected to
> work.
> 

I'll drop the fallback then.

>>>
>>> Finally, I think iterating only once per NUMA level would make more sense.
>>
>> Agree, although it's just a setup stage.
>> I'll check if it can work for me, based on the reference code below.
>>
>>>
>>> I've scribbled something together from those thoughts, see below. This has
>>> just the mlx5 bits touched to show what I mean, but that's just compile
>>> tested.
>>
>> My function returns a *sorted* list of the N closest cpus.
>> That is important. In many cases, drivers do not need all N irqs, but
>> only a portion of it, so it wants to use the closest subset of cpus.
>>
> 
> Are there cases where we can't figure this out in advance? From what I grok
> out of the two callsites you patched, all vectors will be used unless some
> error happens, so compressing the CPUs in a single cpumask seemed
> sufficient.
> 

All vectors will be initialized to support the maximum number of traffic 
rings. However, the actual number of traffic rings can be controlled and 
set to a lower number N_actual < N. In this case, we'll be using only 
N_actual instances and we want them to be the first/closest.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-09 10:18         ` Tariq Toukan
@ 2022-08-09 12:52           ` Valentin Schneider
  2022-08-09 14:04             ` Tariq Toukan
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-09 12:52 UTC (permalink / raw)
  To: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel

On 09/08/22 13:18, Tariq Toukan wrote:
> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>
>> Are there cases where we can't figure this out in advance? From what I grok
>> out of the two callsites you patched, all vectors will be used unless some
>> error happens, so compressing the CPUs in a single cpumask seemed
>> sufficient.
>>
>
> All vectors will be initialized to support the maximum number of traffic
> rings. However, the actual number of traffic rings can be controlled and
> set to a lower number N_actual < N. In this case, we'll be using only
> N_actual instances and we want them to be the first/closest.

Ok, that makes sense, thank you.

In that case I wonder if we'd want a public-facing iterator for
sched_domains_numa_masks[%i][node], rather than copy a portion of
it. Something like the below (naming and implementation haven't been
thought about too much).

  const struct cpumask *sched_numa_level_mask(int node, int level)
  {
          struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);

          if (node >= nr_node_ids || level >= sched_domains_numa_levels)
                  return NULL;

          if (!masks)
                  return NULL;

          return masks[level][node];
  }
  EXPORT_SYMBOL_GPL(sched_numa_level_mask);

  #define for_each_numa_level_mask(node, lvl, mask)	    \
          for (mask = sched_numa_level_mask(node, lvl); mask;	\
               mask = sched_numa_level_mask(node, ++lvl))

  void foo(int node, int cpus[], int ncpus)
  {
          const struct cpumask *mask;
          int lvl = 0;
          int i = 0;
          int cpu;

          rcu_read_lock();
          for_each_numa_level_mask(node, lvl, mask) {
                  for_each_cpu(cpu, mask) {
                          cpus[i] = cpu;
                          if (++i == ncpus)
                                  goto done;
                  }
          }
  done:
          rcu_read_unlock();
  }


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-09 12:52           ` Valentin Schneider
@ 2022-08-09 14:04             ` Tariq Toukan
  2022-08-09 17:36               ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-09 14:04 UTC (permalink / raw)
  To: Valentin Schneider, Tariq Toukan, David S. Miller,
	Saeed Mahameed, Jakub Kicinski, Ingo Molnar, Peter Zijlstra,
	Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel



On 8/9/2022 3:52 PM, Valentin Schneider wrote:
> On 09/08/22 13:18, Tariq Toukan wrote:
>> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>>
>>> Are there cases where we can't figure this out in advance? From what I grok
>>> out of the two callsites you patched, all vectors will be used unless some
>>> error happens, so compressing the CPUs in a single cpumask seemed
>>> sufficient.
>>>
>>
>> All vectors will be initialized to support the maximum number of traffic
>> rings. However, the actual number of traffic rings can be controlled and
>> set to a lower number N_actual < N. In this case, we'll be using only
>> N_actual instances and we want them to be the first/closest.
> 
> Ok, that makes sense, thank you.
> 
> In that case I wonder if we'd want a public-facing iterator for
> sched_domains_numa_masks[%i][node], rather than copy a portion of
> it. Something like the below (naming and implementation haven't been
> thought about too much).
> 
>    const struct cpumask *sched_numa_level_mask(int node, int level)
>    {
>            struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> 
>            if (node >= nr_node_ids || level >= sched_domains_numa_levels)
>                    return NULL;
> 
>            if (!masks)
>                    return NULL;
> 
>            return masks[level][node];
>    }
>    EXPORT_SYMBOL_GPL(sched_numa_level_mask);
> 

The above can be kept static, and expose only the foo() function below, 
similar to my sched_cpus_set_spread().

LGTM.
How do you suggest to proceed?
You want to formalize it? Or should I take it from here?


>    #define for_each_numa_level_mask(node, lvl, mask)	    \
>            for (mask = sched_numa_level_mask(node, lvl); mask;	\
>                 mask = sched_numa_level_mask(node, ++lvl))
> 
>    void foo(int node, int cpus[], int ncpus)
>    {
>            const struct cpumask *mask;
>            int lvl = 0;
>            int i = 0;
>            int cpu;
> 
>            rcu_read_lock();
>            for_each_numa_level_mask(node, lvl, mask) {
>                    for_each_cpu(cpu, mask) {
>                            cpus[i] = cpu;
>                            if (++i == ncpus)
>                                    goto done;
>                    }
>            }
>    done:
>            rcu_read_unlock();
>    }
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-09 14:04             ` Tariq Toukan
@ 2022-08-09 17:36               ` Valentin Schneider
  2022-08-10 10:46                 ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-09 17:36 UTC (permalink / raw)
  To: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel

On 09/08/22 17:04, Tariq Toukan wrote:
> On 8/9/2022 3:52 PM, Valentin Schneider wrote:
>> On 09/08/22 13:18, Tariq Toukan wrote:
>>> On 8/9/2022 1:02 PM, Valentin Schneider wrote:
>>>>
>>>> Are there cases where we can't figure this out in advance? From what I grok
>>>> out of the two callsites you patched, all vectors will be used unless some
>>>> error happens, so compressing the CPUs in a single cpumask seemed
>>>> sufficient.
>>>>
>>>
>>> All vectors will be initialized to support the maximum number of traffic
>>> rings. However, the actual number of traffic rings can be controlled and
>>> set to a lower number N_actual < N. In this case, we'll be using only
>>> N_actual instances and we want them to be the first/closest.
>>
>> Ok, that makes sense, thank you.
>>
>> In that case I wonder if we'd want a public-facing iterator for
>> sched_domains_numa_masks[%i][node], rather than copy a portion of
>> it. Something like the below (naming and implementation haven't been
>> thought about too much).
>>
>>    const struct cpumask *sched_numa_level_mask(int node, int level)
>>    {
>>            struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
>>
>>            if (node >= nr_node_ids || level >= sched_domains_numa_levels)
>>                    return NULL;
>>
>>            if (!masks)
>>                    return NULL;
>>
>>            return masks[level][node];
>>    }
>>    EXPORT_SYMBOL_GPL(sched_numa_level_mask);
>>
>
> The above can be kept static, and expose only the foo() function below,
> similar to my sched_cpus_set_spread().
>

So what I was thinking with this was to only have to export the
sched_numa_level_mask() thing and the iterator, and then consumers would be
free to use whatever storage form they want (cpumask, array, list...).

Right now I believe sched_domains_numa_masks has the right shape for the
interface (for a given node, you a cpumask per distance level) and I
don't want to impose an interface that uses just an array, but perhaps I'm
being silly and the array isn't so bad and simpler to use - that said we
could always build an array-based helper on top of the array of cpumasks
thing.

Clearly I need to scratch my head a bit longer :-)

> LGTM.
> How do you suggest to proceed?
> You want to formalize it? Or should I take it from here?
>

I need to have a think (feel free to ponder and share your thoughts as
well) - I'm happy to push something if I get a brain-wave, but don't let
that hold you back either.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API
  2022-08-09 17:36               ` Valentin Schneider
@ 2022-08-10 10:46                 ` Valentin Schneider
  2022-08-10 10:51                   ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-10 10:46 UTC (permalink / raw)
  To: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli
  Cc: Eric Dumazet, Paolo Abeni, netdev, Gal Pressman, Vincent Guittot,
	linux-kernel

On 09/08/22 18:36, Valentin Schneider wrote:
> On 09/08/22 17:04, Tariq Toukan wrote:
>> LGTM.
>> How do you suggest to proceed?
>> You want to formalize it? Or should I take it from here?
>>
>
> I need to have a think (feel free to ponder and share your thoughts as
> well) - I'm happy to push something if I get a brain-wave, but don't let
> that hold you back either.

This is the form that I hate the least, what do you think?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-10 10:46                 ` Valentin Schneider
@ 2022-08-10 10:51                   ` Valentin Schneider
  2022-08-10 10:51                     ` [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask() Valentin Schneider
  2022-08-10 12:42                     ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Tariq Toukan
  0 siblings, 2 replies; 27+ messages in thread
From: Valentin Schneider @ 2022-08-10 10:51 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Eric Dumazet, Paolo Abeni, Gal Pressman, Vincent Guittot

Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.

sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks. Add in an iteration helper
to iterate over CPUs at an incremental distance from a given node.

Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 include/linux/topology.h | 12 ++++++++++++
 kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..d66e3cf40823 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	return -ENOTSUPP;
+}
+#endif	/* CONFIG_NUMA */
+
+#define for_each_numa_hop_mask(node, hops, mask)			\
+	for (mask = sched_numa_hop_mask(node, hops); !IS_ERR_OR_NULL(mask); \
+	     mask = sched_numa_hop_mask(node, ++hops))
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..f0236a0ae65c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in size; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+		return ERR_PTR(-EINVAL);
+
+	if (!masks)
+		return NULL;
+
+	return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask()
  2022-08-10 10:51                   ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
@ 2022-08-10 10:51                     ` Valentin Schneider
  2022-08-10 12:57                       ` Tariq Toukan
  2022-08-10 12:42                     ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Tariq Toukan
  1 sibling, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-10 10:51 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Tariq Toukan, Tariq Toukan, David S. Miller, Saeed Mahameed,
	Jakub Kicinski, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Eric Dumazet, Paolo Abeni, Gal Pressman, Vincent Guittot

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..2eb4ffd96a95 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -809,9 +809,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
 static int comp_irqs_request(struct mlx5_core_dev *dev)
 {
 	struct mlx5_eq_table *table = dev->priv.eq_table;
+	const struct cpumask *mask;
 	int ncomp_eqs = table->num_comp_eqs;
+	int hops = 0;
 	u16 *cpus;
 	int ret;
+	int cpu;
 	int i;
 
 	ncomp_eqs = table->num_comp_eqs;
@@ -830,8 +833,17 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	rcu_read_lock();
+	for_each_numa_hop_mask(dev->priv.numa_node, hops, mask) {
+		for_each_cpu(cpu, mask) {
+			cpus[i] = cpu;
+			if (++i == ncomp_eqs)
+				goto spread_done;
+		}
+	}
+spread_done:
+	rcu_read_unlock();
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
 	kfree(cpus);
 	if (ret < 0)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-10 10:51                   ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
  2022-08-10 10:51                     ` [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask() Valentin Schneider
@ 2022-08-10 12:42                     ` Tariq Toukan
  2022-08-10 12:57                       ` Tariq Toukan
  1 sibling, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-10 12:42 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot



On 8/10/2022 1:51 PM, Valentin Schneider wrote:
> Tariq has pointed out that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness - cpumask_local_spread() only knows
> about the local node and everything outside is in the same bucket.
> 
> sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
> of CPUs reachable within a given distance budget), introduce
> sched_numa_hop_mask() to export those cpumasks. Add in an iteration helper
> to iterate over CPUs at an incremental distance from a given node.
> 
> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>   include/linux/topology.h | 12 ++++++++++++
>   kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
>   2 files changed, 40 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..d66e3cf40823 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,17 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	return -ENOTSUPP;

missing ERR_PTR()

> +}
> +#endif	/* CONFIG_NUMA */
> +
> +#define for_each_numa_hop_mask(node, hops, mask)			\
> +	for (mask = sched_numa_hop_mask(node, hops); !IS_ERR_OR_NULL(mask); \
> +	     mask = sched_numa_hop_mask(node, ++hops))
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..f0236a0ae65c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in size; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */
> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!masks)
> +		return NULL;
> +
> +	return masks[hops][node];
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask()
  2022-08-10 10:51                     ` [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask() Valentin Schneider
@ 2022-08-10 12:57                       ` Tariq Toukan
  2022-08-10 17:42                         ` Jakub Kicinski
  2022-08-11 14:26                         ` Valentin Schneider
  0 siblings, 2 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-08-10 12:57 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-kernel, Jakub Kicinski
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Eric Dumazet, Paolo Abeni,
	Gal Pressman, Vincent Guittot



On 8/10/2022 1:51 PM, Valentin Schneider wrote:
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
> 

Missing description.

I had a very detailed description with performance numbers and an 
affinity hints example with before/after tables. I don't want to get 
them lost.


> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 229728c80233..2eb4ffd96a95 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -809,9 +809,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
>   static int comp_irqs_request(struct mlx5_core_dev *dev)
>   {
>   	struct mlx5_eq_table *table = dev->priv.eq_table;
> +	const struct cpumask *mask;
>   	int ncomp_eqs = table->num_comp_eqs;
> +	int hops = 0;
>   	u16 *cpus;
>   	int ret;
> +	int cpu;
>   	int i;
>   
>   	ncomp_eqs = table->num_comp_eqs;
> @@ -830,8 +833,17 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   		ret = -ENOMEM;
>   		goto free_irqs;
>   	}
> -	for (i = 0; i < ncomp_eqs; i++)
> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> +	rcu_read_lock();
> +	for_each_numa_hop_mask(dev->priv.numa_node, hops, mask) {

We don't really use this 'hops' iterator. We always pass 0 (not a 
valuable input...), and we do not care about its final value. Probably 
it's best to hide it from the user into the macro.

> +		for_each_cpu(cpu, mask) {
> +			cpus[i] = cpu;
> +			if (++i == ncomp_eqs)
> +				goto spread_done;
> +		}
> +	}
> +spread_done:
> +	rcu_read_unlock();
>   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
>   	kfree(cpus);
>   	if (ret < 0)

This logic is typical. Other drivers would also want to use it.
It must be introduced as a service/API function, if not by the sched 
topology, then at least by the networking subsystem.
Jakub, WDYT?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-10 12:42                     ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Tariq Toukan
@ 2022-08-10 12:57                       ` Tariq Toukan
  2022-08-11 14:26                         ` Valentin Schneider
  0 siblings, 1 reply; 27+ messages in thread
From: Tariq Toukan @ 2022-08-10 12:57 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot



On 8/10/2022 3:42 PM, Tariq Toukan wrote:
> 
> 
> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>> Tariq has pointed out that drivers allocating IRQ vectors would benefit
>> from having smarter NUMA-awareness - cpumask_local_spread() only knows
>> about the local node and everything outside is in the same bucket.
>>
>> sched_domains_numa_masks is pretty much what we want to hand out (a 
>> cpumask
>> of CPUs reachable within a given distance budget), introduce
>> sched_numa_hop_mask() to export those cpumasks. Add in an iteration 
>> helper
>> to iterate over CPUs at an incremental distance from a given node.
>>
>> Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>   include/linux/topology.h | 12 ++++++++++++
>>   kernel/sched/topology.c  | 28 ++++++++++++++++++++++++++++
>>   2 files changed, 40 insertions(+)
>>
>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 4564faafd0e1..d66e3cf40823 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -245,5 +245,17 @@ static inline const struct cpumask 
>> *cpu_cpu_mask(int cpu)
>>       return cpumask_of_node(cpu_to_node(cpu));
>>   }
>> +#ifdef CONFIG_NUMA
>> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
>> +#else
>> +static inline const struct cpumask *sched_numa_hop_mask(int node, int 
>> hops)
>> +{
>> +    return -ENOTSUPP;
> 
> missing ERR_PTR()
> 
>> +}
>> +#endif    /* CONFIG_NUMA */
>> +
>> +#define for_each_numa_hop_mask(node, hops, mask)            \
>> +    for (mask = sched_numa_hop_mask(node, hops); 
>> !IS_ERR_OR_NULL(mask); \
>> +         mask = sched_numa_hop_mask(node, ++hops))
>>   #endif /* _LINUX_TOPOLOGY_H */
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8739c2a5a54e..f0236a0ae65c 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct 
>> cpumask *cpus, int cpu)
>>       return found;
>>   }
>> +/**
>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops 
>> away.
>> + * @node: The node to count hops from.
>> + * @hops: Include CPUs up to that many hops away. 0 means local node.

AFAIU, here you work with a specific level/num of hops, description is 
not accurate.


>> + *
>> + * Requires rcu_lock to be held. Returned cpumask is only valid 
>> within that
>> + * read-side section, copy it if required beyond that.
>> + *
>> + * Note that not all hops are equal in size; see sched_init_numa() 
>> for how
>> + * distances and masks are handled.
>> + *
>> + * Also note that this is a reflection of sched_domains_numa_masks, 
>> which may change
>> + * during the lifetime of the system (offline nodes are taken out of 
>> the masks).
>> + */
>> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
>> +{
>> +    struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
>> +
>> +    if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
>> +        return ERR_PTR(-EINVAL);
>> +
>> +    if (!masks)
>> +        return NULL;
>> +
>> +    return masks[hops][node];
>> +}
>> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
>> +
>>   #endif /* CONFIG_NUMA */
>>   static int __sdt_alloc(const struct cpumask *cpu_map)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask()
  2022-08-10 12:57                       ` Tariq Toukan
@ 2022-08-10 17:42                         ` Jakub Kicinski
  2022-08-11 14:26                         ` Valentin Schneider
  1 sibling, 0 replies; 27+ messages in thread
From: Jakub Kicinski @ 2022-08-10 17:42 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Valentin Schneider, netdev, linux-kernel, Tariq Toukan,
	David S. Miller, Saeed Mahameed, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Eric Dumazet, Paolo Abeni, Gal Pressman,
	Vincent Guittot

On Wed, 10 Aug 2022 15:57:33 +0300 Tariq Toukan wrote:
> > +		for_each_cpu(cpu, mask) {
> > +			cpus[i] = cpu;
> > +			if (++i == ncomp_eqs)
> > +				goto spread_done;
> > +		}
> > +	}
> > +spread_done:
> > +	rcu_read_unlock();
> >   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
> >   	kfree(cpus);
> >   	if (ret < 0)  
> 
> This logic is typical. Other drivers would also want to use it.
> It must be introduced as a service/API function, if not by the sched 
> topology, then at least by the networking subsystem.
> Jakub, WDYT?

Agreed, no preference where the helper would live tho.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-10 12:57                       ` Tariq Toukan
@ 2022-08-11 14:26                         ` Valentin Schneider
  2022-08-14  8:19                           ` Tariq Toukan
  0 siblings, 1 reply; 27+ messages in thread
From: Valentin Schneider @ 2022-08-11 14:26 UTC (permalink / raw)
  To: Tariq Toukan, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot

On 10/08/22 15:57, Tariq Toukan wrote:
> On 8/10/2022 3:42 PM, Tariq Toukan wrote:
>>
>>
>> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 8739c2a5a54e..f0236a0ae65c 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct
>>> cpumask *cpus, int cpu)
>>>       return found;
>>>   }
>>> +/**
>>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops
>>> away.
>>> + * @node: The node to count hops from.
>>> + * @hops: Include CPUs up to that many hops away. 0 means local node.
>
> AFAIU, here you work with a specific level/num of hops, description is
> not accurate.
>

Hmph, unfortunately it's the other way around - the masks do include CPUs
*up to* a number of hops, but in my mlx5 example I've used it as if it only
included CPUs a specific distance away :/

As things stand we'd need a temporary cpumask to account for which CPUs we
have visited (which is what you had in your original submission), but with
a for_each_cpu_andnot() we don't need any of that.

Below is what I ended up with. I've tested it on a range of NUMA topologies
and it behaves as I'd expect, and on the plus side the code required in the
driver side is even simpler than before.

If you don't have major gripes with it, I'll shape that into a proper
series and will let you handle the mlx5/enic bits.

---

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 229728c80233..0a5432903edd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 	int ncomp_eqs = table->num_comp_eqs;
 	u16 *cpus;
 	int ret;
+	int cpu;
 	int i;
 
 	ncomp_eqs = table->num_comp_eqs;
@@ -830,8 +831,15 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
 		ret = -ENOMEM;
 		goto free_irqs;
 	}
-	for (i = 0; i < ncomp_eqs; i++)
-		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
+
+	rcu_read_lock();
+	for_each_numa_hop_cpus(cpu, dev->priv.numa_node) {
+		cpus[i] = cpu;
+		if (++i == ncomp_eqs)
+			goto spread_done;
+	}
+spread_done:
+	rcu_read_unlock();
 	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
 	kfree(cpus);
 	if (ret < 0)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index fe29ac7cc469..ccd5d71aefef 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -157,6 +157,13 @@ static inline unsigned int cpumask_next_and(int n,
 	return n+1;
 }
 
+static inline unsigned int cpumask_next_andnot(int n,
+					    const struct cpumask *srcp,
+					    const struct cpumask *andp)
+{
+	return n+1;
+}
+
 static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
 					     int start, bool wrap)
 {
@@ -194,6 +201,8 @@ static inline int cpumask_any_distribute(const struct cpumask *srcp)
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
 #define for_each_cpu_and(cpu, mask1, mask2)	\
 	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
+#define for_each_cpu_andnot(cpu, mask1, mask2)	\
+	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
 #else
 /**
  * cpumask_first - get the first cpu in a cpumask
@@ -259,6 +268,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 }
 
 int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
+int __pure cpumask_next_andnot(int n, const struct cpumask *, const struct cpumask *);
 int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 int cpumask_any_and_distribute(const struct cpumask *src1p,
@@ -324,6 +334,26 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
 	for ((cpu) = -1;						\
 		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
 		(cpu) < nr_cpu_ids;)
+
+/**
+ * for_each_cpu_andnot - iterate over every cpu in one mask but not in another
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask1: the first cpumask pointer
+ * @mask2: the second cpumask pointer
+ *
+ * This saves a temporary CPU mask in many places.  It is equivalent to:
+ *	struct cpumask tmp;
+ *	cpumask_andnot(&tmp, &mask1, &mask2);
+ *	for_each_cpu(cpu, &tmp)
+ *		...
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_andnot(cpu, mask1, mask2)				\
+	for ((cpu) = -1;						\
+		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
+		(cpu) < nr_cpu_ids;)
+
 #endif /* SMP */
 
 #define CPU_BITS_NONE						\
diff --git a/include/linux/find.h b/include/linux/find.h
index 424ef67d4a42..454cde69b30b 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -10,7 +10,8 @@
 
 extern unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+		unsigned long start, unsigned long invert, unsigned long le,
+		bool negate);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0, 0);
 }
 #endif
 
@@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 0);
+}
+#endif
+
+#ifndef find_next_andnot_bit
+/**
+ * find_next_andnot_bit - find the next set bit in one memory region
+ *                        but not in the other
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ * @offset: The bitnumber to start searching at
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_andnot_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 1);
 }
 #endif
 
@@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0, 0);
 }
 #endif
 
@@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1, 0);
 }
 #endif
 
@@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 1, 0);
 }
 #endif
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..41bed4b883d3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,50 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
+#ifdef CONFIG_NUMA
+extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
+#else
+static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+#endif	/* CONFIG_NUMA */
+
+/**
+ * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
+ *                         starting from a given node.
+ * @cpu: the iteration variable.
+ * @node: the NUMA node to start the search from.
+ *
+ * Requires rcu_lock to be held.
+ * Careful: this is a double loop, 'break' won't work as expected.
+ *
+ *
+ * Implementation notes:
+ *
+ * Providing it is valid, the mask returned by
+ *  sched_numa_hop_mask(node, hops+1)
+ * is a superset of the one returned by
+ *   sched_numa_hop_mask(node, hops)
+ * which may not be that useful for drivers that try to spread things out and
+ * want to visit a CPU not more than once.
+ *
+ * To accomodate for that, we use for_each_cpu_andnot() to iterate over the cpus
+ * of sched_numa_hop_mask(node, hops+1) with the CPUs of
+ * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
+ * a given distance away (rather than *up to* a given distance).
+ *
+ * h=0 forces us to play silly games and pass cpu_none_mask to
+ * for_each_cpu_andnot(), which turns it into for_each_cpu().
+ */
+#define for_each_numa_hop_cpu(cpu, node)				       \
+	for (struct { const struct cpumask *mask; int hops; } __v__ =	       \
+		     { sched_numa_hop_mask(node, 0), 0 };		       \
+	     !IS_ERR_OR_NULL(__v__.mask);				       \
+	     __v__.hops++, __v__.mask = sched_numa_hop_mask(node, __v__.hops)) \
+		for_each_cpu_andnot(cpu, __v__.mask,			       \
+				    __v__.hops ?			       \
+				    sched_numa_hop_mask(node, __v__.hops - 1) :\
+				    cpu_none_mask)
 
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..9182101f2c4f 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,6 +29,6 @@ endif
 # build parallelizes well and finishes roughly at once:
 #
 obj-y += core.o
-obj-y += fair.o
+obj-y += fair.o yolo.o
 obj-y += build_policy.o
 obj-y += build_utility.o
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..f0236a0ae65c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
 	return found;
 }
 
+/**
+ * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
+ * @node: The node to count hops from.
+ * @hops: Include CPUs up to that many hops away. 0 means local node.
+ *
+ * Requires rcu_lock to be held. Returned cpumask is only valid within that
+ * read-side section, copy it if required beyond that.
+ *
+ * Note that not all hops are equal in size; see sched_init_numa() for how
+ * distances and masks are handled.
+ *
+ * Also note that this is a reflection of sched_domains_numa_masks, which may change
+ * during the lifetime of the system (offline nodes are taken out of the masks).
+ */
+const struct cpumask *sched_numa_hop_mask(int node, int hops)
+{
+	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
+
+	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
+		return ERR_PTR(-EINVAL);
+
+	if (!masks)
+		return NULL;
+
+	return masks[hops][node];
+}
+EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
+
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index a971a82d2f43..8bcf7e919193 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -42,6 +42,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
+/**
+ * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first cpumask pointer
+ * @src2p: the second cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if no further cpus set in both.
+ */
+int cpumask_next_andnot(int n, const struct cpumask *src1p,
+		     const struct cpumask *src2p)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+		nr_cpumask_bits, n + 1);
+}
+EXPORT_SYMBOL(cpumask_next_andnot);
+
 /**
  * cpumask_any_but - return a "random" in a cpumask, but not this one.
  * @mask: the cpumask to search
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..6e5f42c621a9 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,17 +21,19 @@
 
 #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
 	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
-	!defined(find_next_and_bit)
+	!defined(find_next_and_bit) || !defined(find_next_andnot_bit)
 /*
  * This is a common helper function for find_next_bit, find_next_zero_bit, and
  * find_next_and_bit. The differences are:
  *  - The "invert" argument, which is XORed with each fetched word before
  *    searching it for one bits.
- *  - The optional "addr2", which is anded with "addr1" if present.
+ *  - The optional "addr2", negated if "negate" and ANDed with "addr1" if
+ *    present.
  */
 unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
+		unsigned long start, unsigned long invert, unsigned long le,
+		bool negate)
 {
 	unsigned long tmp, mask;
 
@@ -40,7 +42,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 
 	tmp = addr1[start / BITS_PER_LONG];
 	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
+		tmp &= negate ?
+		       ~addr2[start / BITS_PER_LONG] :
+			addr2[start / BITS_PER_LONG];
 	tmp ^= invert;
 
 	/* Handle 1st word. */
@@ -59,7 +63,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 
 		tmp = addr1[start / BITS_PER_LONG];
 		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
+			tmp &= negate ?
+			       ~addr2[start / BITS_PER_LONG] :
+				addr2[start / BITS_PER_LONG];
 		tmp ^= invert;
 	}
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask()
  2022-08-10 12:57                       ` Tariq Toukan
  2022-08-10 17:42                         ` Jakub Kicinski
@ 2022-08-11 14:26                         ` Valentin Schneider
  1 sibling, 0 replies; 27+ messages in thread
From: Valentin Schneider @ 2022-08-11 14:26 UTC (permalink / raw)
  To: Tariq Toukan, netdev, linux-kernel, Jakub Kicinski
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Eric Dumazet, Paolo Abeni,
	Gal Pressman, Vincent Guittot

On 10/08/22 15:57, Tariq Toukan wrote:
> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>
> Missing description.
>
> I had a very detailed description with performance numbers and an
> affinity hints example with before/after tables. I don't want to get
> them lost.
>

Me neither! This here is just a stand-in to show how the interface would be
used, I'd much rather have someone who actually knows the code and can
easily test it do it :)

>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>> index 229728c80233..2eb4ffd96a95 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
>> @@ -809,9 +809,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
>>   static int comp_irqs_request(struct mlx5_core_dev *dev)
>>   {
>>      struct mlx5_eq_table *table = dev->priv.eq_table;
>> +	const struct cpumask *mask;
>>      int ncomp_eqs = table->num_comp_eqs;
>> +	int hops = 0;
>>      u16 *cpus;
>>      int ret;
>> +	int cpu;
>>      int i;
>>
>>      ncomp_eqs = table->num_comp_eqs;
>> @@ -830,8 +833,17 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>>              ret = -ENOMEM;
>>              goto free_irqs;
>>      }
>> -	for (i = 0; i < ncomp_eqs; i++)
>> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
>> +
>> +	rcu_read_lock();
>> +	for_each_numa_hop_mask(dev->priv.numa_node, hops, mask) {
>
> We don't really use this 'hops' iterator. We always pass 0 (not a
> valuable input...), and we do not care about its final value. Probably
> it's best to hide it from the user into the macro.
>

That's a very valid point. After a lot of mulling around, I've found some
way to hide it away in a macro, but it's not pretty :-) cf. other email.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-11 14:26                         ` Valentin Schneider
@ 2022-08-14  8:19                           ` Tariq Toukan
  2022-08-14  8:26                             ` Tariq Toukan
  2022-08-15 14:20                             ` Valentin Schneider
  0 siblings, 2 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-08-14  8:19 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot



On 8/11/2022 5:26 PM, Valentin Schneider wrote:
> On 10/08/22 15:57, Tariq Toukan wrote:
>> On 8/10/2022 3:42 PM, Tariq Toukan wrote:
>>>
>>>
>>> On 8/10/2022 1:51 PM, Valentin Schneider wrote:
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index 8739c2a5a54e..f0236a0ae65c 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct
>>>> cpumask *cpus, int cpu)
>>>>        return found;
>>>>    }
>>>> +/**
>>>> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops
>>>> away.
>>>> + * @node: The node to count hops from.
>>>> + * @hops: Include CPUs up to that many hops away. 0 means local node.
>>
>> AFAIU, here you work with a specific level/num of hops, description is
>> not accurate.
>>
> 
> Hmph, unfortunately it's the other way around - the masks do include CPUs
> *up to* a number of hops, but in my mlx5 example I've used it as if it only
> included CPUs a specific distance away :/
> 

Aha, got it. It makes it more challenging :)

> As things stand we'd need a temporary cpumask to account for which CPUs we
> have visited (which is what you had in your original submission), but with
> a for_each_cpu_andnot() we don't need any of that.
> 
> Below is what I ended up with. I've tested it on a range of NUMA topologies
> and it behaves as I'd expect, and on the plus side the code required in the
> driver side is even simpler than before.
> 
> If you don't have major gripes with it, I'll shape that into a proper
> series and will let you handle the mlx5/enic bits.
> 

The API is indeed easy to use, the driver part looks straight forward.

I appreciate the tricks you used to make it work!
However, the implementation is relatively complicated, not easy to read 
or understand, and touches several files. I do understand what you did 
here, but I guess not all respective maintainers will like it. Let's see.

One alternative to consider, that will simplify things up, is switching 
back to returning an array of cpus, ordered by their distance, up to a 
provided argument 'npus'.
This way, you will iterate over sched_numa_hop_mask() internally, easily 
maintaining the cpumask diffs between two hops, without the need of 
making it on-the-fly as part an an exposed for-loop macro.

> ---
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index 229728c80233..0a5432903edd 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -812,6 +812,7 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   	int ncomp_eqs = table->num_comp_eqs;
>   	u16 *cpus;
>   	int ret;
> +	int cpu;
>   	int i;
>   
>   	ncomp_eqs = table->num_comp_eqs;
> @@ -830,8 +831,15 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
>   		ret = -ENOMEM;
>   		goto free_irqs;
>   	}
> -	for (i = 0; i < ncomp_eqs; i++)
> -		cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> +	rcu_read_lock();
> +	for_each_numa_hop_cpus(cpu, dev->priv.numa_node) {
> +		cpus[i] = cpu;
> +		if (++i == ncomp_eqs)
> +			goto spread_done;
> +	}
> +spread_done:
> +	rcu_read_unlock();
>   	ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
>   	kfree(cpus);
>   	if (ret < 0)
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index fe29ac7cc469..ccd5d71aefef 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -157,6 +157,13 @@ static inline unsigned int cpumask_next_and(int n,
>   	return n+1;
>   }
>   
> +static inline unsigned int cpumask_next_andnot(int n,
> +					    const struct cpumask *srcp,
> +					    const struct cpumask *andp)
> +{
> +	return n+1;
> +}
> +
>   static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
>   					     int start, bool wrap)
>   {
> @@ -194,6 +201,8 @@ static inline int cpumask_any_distribute(const struct cpumask *srcp)
>   	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)(start))
>   #define for_each_cpu_and(cpu, mask1, mask2)	\
>   	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
> +#define for_each_cpu_andnot(cpu, mask1, mask2)	\
> +	for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask1, (void)mask2)
>   #else
>   /**
>    * cpumask_first - get the first cpu in a cpumask
> @@ -259,6 +268,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
>   }
>   
>   int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
> +int __pure cpumask_next_andnot(int n, const struct cpumask *, const struct cpumask *);
>   int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
>   unsigned int cpumask_local_spread(unsigned int i, int node);
>   int cpumask_any_and_distribute(const struct cpumask *src1p,
> @@ -324,6 +334,26 @@ extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool
>   	for ((cpu) = -1;						\
>   		(cpu) = cpumask_next_and((cpu), (mask1), (mask2)),	\
>   		(cpu) < nr_cpu_ids;)
> +
> +/**
> + * for_each_cpu_andnot - iterate over every cpu in one mask but not in another
> + * @cpu: the (optionally unsigned) integer iterator
> + * @mask1: the first cpumask pointer
> + * @mask2: the second cpumask pointer
> + *
> + * This saves a temporary CPU mask in many places.  It is equivalent to:
> + *	struct cpumask tmp;
> + *	cpumask_andnot(&tmp, &mask1, &mask2);
> + *	for_each_cpu(cpu, &tmp)
> + *		...
> + *
> + * After the loop, cpu is >= nr_cpu_ids.
> + */
> +#define for_each_cpu_andnot(cpu, mask1, mask2)				\
> +	for ((cpu) = -1;						\
> +		(cpu) = cpumask_next_andnot((cpu), (mask1), (mask2)),	\
> +		(cpu) < nr_cpu_ids;)
> +
>   #endif /* SMP */
>   
>   #define CPU_BITS_NONE						\
> diff --git a/include/linux/find.h b/include/linux/find.h
> index 424ef67d4a42..454cde69b30b 100644
> --- a/include/linux/find.h
> +++ b/include/linux/find.h
> @@ -10,7 +10,8 @@
>   
>   extern unsigned long _find_next_bit(const unsigned long *addr1,
>   		const unsigned long *addr2, unsigned long nbits,
> -		unsigned long start, unsigned long invert, unsigned long le);
> +		unsigned long start, unsigned long invert, unsigned long le,
> +		bool negate);
>   extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
>   extern unsigned long _find_first_and_bit(const unsigned long *addr1,
>   					 const unsigned long *addr2, unsigned long size);
> @@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
> +	return _find_next_bit(addr, NULL, size, offset, 0UL, 0, 0);
>   }
>   #endif
>   
> @@ -71,7 +72,38 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
> +	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 0);
> +}
> +#endif
> +
> +#ifndef find_next_andnot_bit
> +/**
> + * find_next_andnot_bit - find the next set bit in one memory region
> + *                        but not in the other
> + * @addr1: The first address to base the search on
> + * @addr2: The second address to base the search on
> + * @size: The bitmap size in bits
> + * @offset: The bitnumber to start searching at
> + *
> + * Returns the bit number for the next set bit
> + * If no bits are set, returns @size.
> + */
> +static inline
> +unsigned long find_next_andnot_bit(const unsigned long *addr1,
> +		const unsigned long *addr2, unsigned long size,
> +		unsigned long offset)
> +{
> +	if (small_const_nbits(size)) {
> +		unsigned long val;
> +
> +		if (unlikely(offset >= size))
> +			return size;
> +
> +		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
> +		return val ? __ffs(val) : size;
> +	}
> +
> +	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0, 1);
>   }
>   #endif
>   
> @@ -99,7 +131,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
>   		return val == ~0UL ? size : ffz(val);
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
> +	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0, 0);
>   }
>   #endif
>   
> @@ -247,7 +279,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
>   		return val == ~0UL ? size : ffz(val);
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
> +	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1, 0);
>   }
>   #endif
>   
> @@ -266,7 +298,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
>   		return val ? __ffs(val) : size;
>   	}
>   
> -	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
> +	return _find_next_bit(addr, NULL, size, offset, 0UL, 1, 0);
>   }
>   #endif
>   
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..41bed4b883d3 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,50 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
>   	return cpumask_of_node(cpu_to_node(cpu));
>   }
>   
> +#ifdef CONFIG_NUMA
> +extern const struct cpumask *sched_numa_hop_mask(int node, int hops);
> +#else
> +static inline const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	return ERR_PTR(-ENOTSUPP);
> +}
> +#endif	/* CONFIG_NUMA */
> +
> +/**
> + * for_each_numa_hop_cpu - iterate over CPUs by increasing NUMA distance,
> + *                         starting from a given node.
> + * @cpu: the iteration variable.
> + * @node: the NUMA node to start the search from.
> + *
> + * Requires rcu_lock to be held.
> + * Careful: this is a double loop, 'break' won't work as expected.
> + *
> + *
> + * Implementation notes:
> + *
> + * Providing it is valid, the mask returned by
> + *  sched_numa_hop_mask(node, hops+1)
> + * is a superset of the one returned by
> + *   sched_numa_hop_mask(node, hops)
> + * which may not be that useful for drivers that try to spread things out and
> + * want to visit a CPU not more than once.
> + *
> + * To accomodate for that, we use for_each_cpu_andnot() to iterate over the cpus
> + * of sched_numa_hop_mask(node, hops+1) with the CPUs of
> + * sched_numa_hop_mask(node, hops) removed, IOW we only iterate over CPUs
> + * a given distance away (rather than *up to* a given distance).
> + *
> + * h=0 forces us to play silly games and pass cpu_none_mask to
> + * for_each_cpu_andnot(), which turns it into for_each_cpu().
> + */
> +#define for_each_numa_hop_cpu(cpu, node)				       \
> +	for (struct { const struct cpumask *mask; int hops; } __v__ =	       \
> +		     { sched_numa_hop_mask(node, 0), 0 };		       \
> +	     !IS_ERR_OR_NULL(__v__.mask);				       \
> +	     __v__.hops++, __v__.mask = sched_numa_hop_mask(node, __v__.hops)) \
> +		for_each_cpu_andnot(cpu, __v__.mask,			       \
> +				    __v__.hops ?			       \
> +				    sched_numa_hop_mask(node, __v__.hops - 1) :\
> +				    cpu_none_mask)
>   
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 976092b7bd45..9182101f2c4f 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -29,6 +29,6 @@ endif
>   # build parallelizes well and finishes roughly at once:
>   #
>   obj-y += core.o
> -obj-y += fair.o
> +obj-y += fair.o yolo.o
>   obj-y += build_policy.o
>   obj-y += build_utility.o
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..f0236a0ae65c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,34 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>   	return found;
>   }
>   
> +/**
> + * sched_numa_hop_mask() - Get the cpumask of CPUs at most @hops hops away.
> + * @node: The node to count hops from.
> + * @hops: Include CPUs up to that many hops away. 0 means local node.
> + *
> + * Requires rcu_lock to be held. Returned cpumask is only valid within that
> + * read-side section, copy it if required beyond that.
> + *
> + * Note that not all hops are equal in size; see sched_init_numa() for how
> + * distances and masks are handled.
> + *
> + * Also note that this is a reflection of sched_domains_numa_masks, which may change
> + * during the lifetime of the system (offline nodes are taken out of the masks).
> + */
> +const struct cpumask *sched_numa_hop_mask(int node, int hops)
> +{
> +	struct cpumask ***masks = rcu_dereference(sched_domains_numa_masks);
> +
> +	if (node >= nr_node_ids || hops >= sched_domains_numa_levels)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (!masks)
> +		return NULL;
> +
> +	return masks[hops][node];
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_hop_mask);
> +
>   #endif /* CONFIG_NUMA */
>   
>   static int __sdt_alloc(const struct cpumask *cpu_map)
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index a971a82d2f43..8bcf7e919193 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -42,6 +42,25 @@ int cpumask_next_and(int n, const struct cpumask *src1p,
>   }
>   EXPORT_SYMBOL(cpumask_next_and);
>   
> +/**
> + * cpumask_next_andnot - get the next cpu in *src1p & ~*src2p
> + * @n: the cpu prior to the place to search (ie. return will be > @n)
> + * @src1p: the first cpumask pointer
> + * @src2p: the second cpumask pointer
> + *
> + * Returns >= nr_cpu_ids if no further cpus set in both.
> + */
> +int cpumask_next_andnot(int n, const struct cpumask *src1p,
> +		     const struct cpumask *src2p)
> +{
> +	/* -1 is a legal arg here. */
> +	if (n != -1)
> +		cpumask_check(n);
> +	return find_next_andnot_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> +		nr_cpumask_bits, n + 1);
> +}
> +EXPORT_SYMBOL(cpumask_next_andnot);
> +
>   /**
>    * cpumask_any_but - return a "random" in a cpumask, but not this one.
>    * @mask: the cpumask to search
> diff --git a/lib/find_bit.c b/lib/find_bit.c
> index 1b8e4b2a9cba..6e5f42c621a9 100644
> --- a/lib/find_bit.c
> +++ b/lib/find_bit.c
> @@ -21,17 +21,19 @@
>   
>   #if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
>   	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
> -	!defined(find_next_and_bit)
> +	!defined(find_next_and_bit) || !defined(find_next_andnot_bit)
>   /*
>    * This is a common helper function for find_next_bit, find_next_zero_bit, and
>    * find_next_and_bit. The differences are:
>    *  - The "invert" argument, which is XORed with each fetched word before
>    *    searching it for one bits.
> - *  - The optional "addr2", which is anded with "addr1" if present.
> + *  - The optional "addr2", negated if "negate" and ANDed with "addr1" if
> + *    present.
>    */
>   unsigned long _find_next_bit(const unsigned long *addr1,
>   		const unsigned long *addr2, unsigned long nbits,
> -		unsigned long start, unsigned long invert, unsigned long le)
> +		unsigned long start, unsigned long invert, unsigned long le,
> +		bool negate)
>   {
>   	unsigned long tmp, mask;
>   
> @@ -40,7 +42,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
>   
>   	tmp = addr1[start / BITS_PER_LONG];
>   	if (addr2)
> -		tmp &= addr2[start / BITS_PER_LONG];
> +		tmp &= negate ?
> +		       ~addr2[start / BITS_PER_LONG] :
> +			addr2[start / BITS_PER_LONG];
>   	tmp ^= invert;
>   
>   	/* Handle 1st word. */
> @@ -59,7 +63,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
>   
>   		tmp = addr1[start / BITS_PER_LONG];
>   		if (addr2)
> -			tmp &= addr2[start / BITS_PER_LONG];
> +			tmp &= negate ?
> +			       ~addr2[start / BITS_PER_LONG] :
> +				addr2[start / BITS_PER_LONG];
>   		tmp ^= invert;
>   	}
>   
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-14  8:19                           ` Tariq Toukan
@ 2022-08-14  8:26                             ` Tariq Toukan
  2022-08-15 14:20                             ` Valentin Schneider
  1 sibling, 0 replies; 27+ messages in thread
From: Tariq Toukan @ 2022-08-14  8:26 UTC (permalink / raw)
  To: Valentin Schneider, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot

>> If you don't have major gripes with it, I'll shape that into a proper
>> series and will let you handle the mlx5/enic bits.
>>

Sure I can take the drivers/networking parts. In order to submit the API 
and its usage combined in a one patchset, I can send you these parts 
privately and you combine it into your submitted series, or the other 
way-around if you want me to do the submission.
Both work for me.
I will do it once we converge with the API.

Important note: I'll be out-of-office, with very limited access to 
email, until Sep 1st. I doubt I can progress much before then.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask()
  2022-08-14  8:19                           ` Tariq Toukan
  2022-08-14  8:26                             ` Tariq Toukan
@ 2022-08-15 14:20                             ` Valentin Schneider
  1 sibling, 0 replies; 27+ messages in thread
From: Valentin Schneider @ 2022-08-15 14:20 UTC (permalink / raw)
  To: Tariq Toukan, netdev, linux-kernel
  Cc: Tariq Toukan, David S. Miller, Saeed Mahameed, Jakub Kicinski,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Eric Dumazet,
	Paolo Abeni, Gal Pressman, Vincent Guittot

On 14/08/22 11:19, Tariq Toukan wrote:
> The API is indeed easy to use, the driver part looks straight forward.
>
> I appreciate the tricks you used to make it work!
> However, the implementation is relatively complicated, not easy to read
> or understand, and touches several files. I do understand what you did
> here, but I guess not all respective maintainers will like it. Let's see.
>

Dumping it all into a single diff also doesn't help :-) I think the changes to
get a for_each_cpu_andnot() are straightforward enough, the one eyesore is
the macro but I consider it a necessary evil to get an allocation-free
interface.

> One alternative to consider, that will simplify things up, is switching
> back to returning an array of cpus, ordered by their distance, up to a
> provided argument 'npus'.
> This way, you will iterate over sched_numa_hop_mask() internally, easily
> maintaining the cpumask diffs between two hops, without the need of
> making it on-the-fly as part an an exposed for-loop macro.
>

That requires extra storage however: at the very least the array, and a
temp cpumask to remember already-visited CPUs (the alternative being
scanning the array every CPU iteration to figure out if it's been added
already).

I'm going to submit the cpumask / sched changes, hopefully I get to
something by the time you're back from PTO.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-08-15 14:21 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28 19:12 [PATCH net-next V4 0/3] Introduce and use NUMA distance metrics Tariq Toukan
2022-07-28 19:12 ` [PATCH net-next V4 1/3] sched/topology: Add NUMA-based CPUs spread API Tariq Toukan
2022-07-30 17:29   ` Tariq Toukan
2022-08-02  6:40     ` Tariq Toukan
2022-08-02  9:38       ` Valentin Schneider
2022-08-02 16:05         ` Jakub Kicinski
2022-08-04 17:28   ` Valentin Schneider
2022-08-08 14:39     ` Tariq Toukan
2022-08-09 10:02       ` Valentin Schneider
2022-08-09 10:18         ` Tariq Toukan
2022-08-09 12:52           ` Valentin Schneider
2022-08-09 14:04             ` Tariq Toukan
2022-08-09 17:36               ` Valentin Schneider
2022-08-10 10:46                 ` Valentin Schneider
2022-08-10 10:51                   ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Valentin Schneider
2022-08-10 10:51                     ` [PATCH 2/2] net/mlx5e: Leverage sched_numa_hop_mask() Valentin Schneider
2022-08-10 12:57                       ` Tariq Toukan
2022-08-10 17:42                         ` Jakub Kicinski
2022-08-11 14:26                         ` Valentin Schneider
2022-08-10 12:42                     ` [PATCH 1/2] sched/topology: Introduce sched_numa_hop_mask() Tariq Toukan
2022-08-10 12:57                       ` Tariq Toukan
2022-08-11 14:26                         ` Valentin Schneider
2022-08-14  8:19                           ` Tariq Toukan
2022-08-14  8:26                             ` Tariq Toukan
2022-08-15 14:20                             ` Valentin Schneider
2022-07-28 19:12 ` [PATCH net-next V4 2/3] net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints Tariq Toukan
2022-07-28 19:12 ` [PATCH net-next V4 3/3] enic: Use NUMA distances logic when setting " Tariq Toukan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).