[RFC PATCH 0/5] introduce sched-idle balancing

* [RFC PATCH 0/5] introduce sched-idle balancing
@ 2022-02-17 15:43 Abel Wu
  2022-02-17 15:43 ` [RFC PATCH 1/5] sched/fair: record overloaded cpus Abel Wu
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Abel Wu @ 2022-02-17 15:43 UTC (permalink / raw)
  To: Ben Segall, Daniel Bristot de Oliveira, Dietmar Eggemann,
	Ingo Molnar, Juri Lelli, Mel Gorman, Peter Zijlstra,
	Steven Rostedt, Vincent Guittot
  Cc: linux-kernel

Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.

An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.

The overloaded cfs rqs can cause performance issues to
both task types:

  - for latency critical tasks like SCHED_NORMAL,
    time of waiting in the rq will increase and
    result in higher pct99 latency, and

  - batch tasks may not be able to make full use
    of cpu capacity if sched-idle rq exists, thus
    presents poorer throughput.

So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:

  - pull non-idle tasks for sched-idle or idle rqs
    from the overloaded ones, and

  - prevent pulling the last non-idle task in an rq

The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.

Tests are done in an Intel Xeon E5-2650 v4 server with
2 NUMA nodes each of which has 12 cores, and with SMT2
enabled, so 48 CPUs in total. Test results are listed
as follows.

  - we used perf messaging test to test throughput
    at different load (groups).

      perf bench sched messaging -g [N] -l 40000

	N	w/o	w/	diff
	1	2.897	2.834	-2.17%
	3	5.156	4.904	-4.89%
	5	7.850	7.617	-2.97%
	10	15.140	14.574	-3.74%
	20	29.387	27.602	-6.07%

    the result shows approximate 2~6% improvement.

  - and schbench to test latency performance in two
    scenarios: quiet and noisy. In quiet test, we
    run schbench in a normal cpu cgroup in a quiet
    system, while the noisy test additionally runs
    perf messaging workload inside an idle cgroup
    as nosie.

      schbench -m 2 -t 24 -i 60 -r 60
      perf bench sched messaging -g 1 -l 4000000

	[quiet]
			w/o	w/
	50.0th		31	31
	75.0th		45	45
	90.0th		55	55
	95.0th		62	61
	*99.0th		85	86
	99.5th		565	318
	99.9th		11536	10992
	max		13029	13067

	[nosiy]
			w/o	w/
	50.0th		34	32
	75.0th		48	45
	90.0th		58	55
	95.0th		65	61
	*99.0th		2364	208
	99.5th		6696	2068
	99.9th		12688	8816
	max		15209	14191

    it can be seen that the quiet test results are
    quite similar, but the p99 latency is greatly
    improved in the nosiy test.

Comments and tests are appreciated!

Abel Wu (5):
  sched/fair: record overloaded cpus
  sched/fair: introduce sched-idle balance
  sched/fair: add stats for sched-idle balancing
  sched/fair: filter out overloaded cpus in sis
  sched/fair: favor cpu capacity for idle tasks

 include/linux/sched/idle.h     |   1 +
 include/linux/sched/topology.h |  15 ++++
 kernel/sched/core.c            |   1 +
 kernel/sched/fair.c            | 187 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   6 ++
 kernel/sched/stats.c           |   5 +-
 kernel/sched/topology.c        |   4 +-
 7 files changed, 215 insertions(+), 4 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread