[PATCH 0/8] sched/topology: add for_each_numa_cpu() macro

* [PATCH 0/8] sched/topology: add for_each_numa_cpu() macro
@ 2023-03-25 18:55 Yury Norov
  2023-03-25 18:55 ` [PATCH 1/8] lib/find: add find_next_and_andnot_bit() Yury Norov
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Yury Norov @ 2023-03-25 18:55 UTC (permalink / raw)
  To: Jakub Kicinski, netdev, linux-rdma, linux-kernel
  Cc: Yury Norov, Saeed Mahameed, Pawel Chmielewski, Leon Romanovsky,
	David S. Miller, Eric Dumazet, Paolo Abeni, Andy Shevchenko,
	Rasmus Villemoes, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tariq Toukan, Gal Pressman, Greg Kroah-Hartman, Heiko Carstens,
	Barry Song

for_each_cpu() is widely used in kernel, and it's beneficial to create
a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing
codebase to it is not an easy process.

This series adds for_each_numa_cpu(), which is designed to be similar to
the for_each_cpu(). It allows to convert existing code to NUMA-aware as
simple as adding a hop iterator variable and passing it inside new macro.
for_each_numa_cpu() takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is
Melanox's in-tree driver, and another is Intel's in-review driver:

https://lore.kernel.org/lkml/20230216145455.661709-1-pawel.chmielewski@intel.com/

Both real-life examples follow the same pattern:

	for_each_numa_hop_mask(cpus, prev, node) {
 		for_each_cpu_andnot(cpu, cpus, prev) {
 			if (cnt++ == max_num)
 				goto out;
 			do_something(cpu);
 		}
		prev = cpus;
 	}

With the new macro, it has a more standard look, like this:

	for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
		if (cnt++ == max_num)
			break;
		do_something(cpu);
 	}

Straight conversion of existing for_each_cpu() codebase to NUMA-aware
version with for_each_numa_hop_mask() is difficult because it doesn't
take a user-provided cpu mask, and eventually ends up with open-coded
double loop. With for_each_numa_cpu() it shouldn't be a brainteaser.
Consider the NUMA-ignorant example:

	cpumask_t cpus = get_mask();
	int cnt = 0, cpu;

	for_each_cpu(cpu, cpus) {
		if (cnt++ == max_num)
			break;
		do_something(cpu);
 	}

Converting it to NUMA-aware version would be as simple as:

	cpumask_t cpus = get_mask();
	int node = get_node();
	int cnt = 0, hop, cpu;

	for_each_numa_cpu(cpu, hop, node, cpus) {
		if (cnt++ == max_num)
			break;
		do_something(cpu);
 	}

The latter looks more verbose and avoids from open-coding that annoying
double loop. Another advantage is that it works with a 'hop' parameter with
the clear meaning of NUMA distance, and doesn't make people not familiar
to enumerator internals bothering with current and previous masks machinery.

Yury Norov (8):
  lib/find: add find_next_and_andnot_bit()
  sched/topology: introduce sched_numa_find_next_cpu()
  sched/topology: add for_each_numa_cpu() macro
  net: mlx5: switch comp_irqs_request() to using for_each_numa_cpu
  lib/cpumask: update comment to cpumask_local_spread()
  sched/topology: export sched_domains_numa_levels
  lib: add test for for_each_numa_{cpu,hop_mask}()
  sched: drop for_each_numa_hop_mask()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ++----
 include/linux/find.h                         | 43 ++++++++++++++
 include/linux/topology.h                     | 39 ++++++++-----
 kernel/sched/topology.c                      | 59 +++++++++++---------
 lib/cpumask.c                                |  7 +--
 lib/find_bit.c                               | 12 ++++
 lib/test_bitmap.c                            | 16 ++++++
 7 files changed, 136 insertions(+), 56 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread