[PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality

* [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality
@ 2022-11-12 19:09 Yury Norov
  2022-11-12 19:09 ` [PATCH v2 1/4] lib/find: introduce find_nth_and_andnot_bit Yury Norov
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Yury Norov @ 2022-11-12 19:09 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, haniel Bristot de Oliveira, Dietmar Eggemann,
	Gal Pressman, Greg Kroah-Hartman, Heiko Carstens, Ingo Molnar,
	Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Mel Gorman,
	Peter Zijlstra, Rasmus Villemoes, Saeed Mahameed, Steven Rostedt,
	Tariq Toukan, Tariq Toukan, Tony Luck, Valentin Schneider,
	Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e:
Improve remote NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to their measurements, for mlx5e:

        Bottleneck in RX side is released, reached linerate (~1.8x speedup).
        ~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in case of mlx5e.

I tested new behavior on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/
v2: 
 - use bsearch() in sched_numa_find_nth_cpu();
 - fix missing 'static inline' in 3rd patch.

Yury Norov (4):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality

 include/linux/cpumask.h  | 20 +++++++++++++++
 include/linux/find.h     | 33 ++++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++
 kernel/sched/topology.c  | 55 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 12 ++-------
 lib/find_bit.c           |  9 +++++++
 6 files changed, 127 insertions(+), 10 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 16+ messages in thread