[PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

* [PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality
@ 2023-01-21  4:24 Yury Norov
  2023-01-21  4:24 ` [PATCH 1/9] lib/find: introduce find_nth_and_andnot_bit Yury Norov
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Yury Norov @ 2023-01-21  4:24 UTC (permalink / raw)
  To: linux-kernel, David S. Miller, Andy Shevchenko, Barry Song,
	Ben Segall, Dietmar Eggemann, Gal Pressman, Greg Kroah-Hartman,
	Haniel Bristot de Oliveira, Heiko Carstens, Ingo Molnar,
	Jacob Keller, Jakub Kicinski, Jason Gunthorpe, Jesse Brandeburg,
	Jonathan Cameron, Juri Lelli, Leon Romanovsky, Linus Torvalds,
	Mel Gorman, Peter Lafreniere, Peter Zijlstra, Rasmus Villemoes,
	Saeed Mahameed, Steven Rostedt, Tariq Toukan, Tariq Toukan,
	Tony Luck, Valentin Schneider, Vincent Guittot
  Cc: Yury Norov, linux-crypto, netdev, linux-rdma

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This has significant performance implications on NUMA machines, for example
when using NUMA-aware allocated memory together with NUMA-aware IRQ
affinity hints.

Performance tests from patch 8 of this series for mellanox network
driver show:

  TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
  Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

  +-------------------------+-----------+------------------+------------------+
  |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
  +-------------------------+-----------+------------------+------------------+
  | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
  +-------------------------+-----------+------------------+------------------+

  Bottleneck in RX side is released, reached linerate (~1.8x speedup).
  ~30% less cpu util on TX.

This series was supposed to be included in v6.2, but that didn't happen. It
spent enough in -next without any issues, so I hope we'll finally see it
in v6.3.

I believe, the best way would be moving it with scheduler patches, but I'm
OK to try again with bitmap branch as well.

Tariq Toukan (1):
  net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
    hints

Valentin Schneider (2):
  sched/topology: Introduce sched_numa_hop_mask()
  sched/topology: Introduce for_each_numa_hop_mask()

Yury Norov (6):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality
  lib/cpumask: reorganize cpumask_local_spread() logic
  lib/cpumask: update comment for cpumask_local_spread()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 +++-
 include/linux/cpumask.h                      | 20 +++++
 include/linux/find.h                         | 33 +++++++
 include/linux/topology.h                     | 33 +++++++
 kernel/sched/topology.c                      | 90 ++++++++++++++++++++
 lib/cpumask.c                                | 52 ++++++-----
 lib/find_bit.c                               |  9 ++
 7 files changed, 230 insertions(+), 25 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 36+ messages in thread