linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] sched: Implement shared runqueue in CFS
@ 2023-08-09 22:12 David Vernet
  2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
                   ` (9 more replies)
  0 siblings, 10 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

Changes
-------

This is v3 of the shared runqueue patchset. This patch set is based off
of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
bandwidth in use") on the sched/core branch of tip.git.

v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/

v2 -> v3 changes:
- Don't leave stale tasks in the lists when the SHARED_RUNQ feature is
  disabled (Abel Wu)

- Use raw spin lock instead of spinlock_t (Peter)

- Fix return value from shared_runq_pick_next_task() to match the
  semantics expected by newidle_balance() (Gautham, Abel)

- Fold patch __enqueue_entity() / __dequeue_entity() into previous patch
  (Peter)

- Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
  (Peter)

- Properly support hotplug and recreating sched domains (Peter)

- Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq ==
  target_rq in shared_runq_pick_next_task() (Abel)

- Only issue list_del_init() in shared_runq_dequeue_task() if the task
  is still in the list after acquiring the lock (Aaron Lu)

- Slightly change shared_runq_shard_idx() to make it more likely to keep
  SMT siblings on the same bucket (Peter)

v1 -> v2 changes:
- Change name from swqueue to shared_runq (Peter)

- Shard per-LLC shared runqueues to avoid contention on scheduler-heavy
  workloads (Peter)

- Pull tasks from the shared_runq in newidle_balance() rather than in
  pick_next_task_fair() (Peter and Vincent)

- Rename a few functions to reflect their actual purpose. For example,
  shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)

- Expose move_queued_task() from core.c rather than migrate_task_to()
  (Peter)

- Properly check is_cpu_allowed() when pulling a task from a shared_runq
  to ensure it can actually be migrated (Peter and Gautham)

- Dropped RFC tag

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (SHARED_RUNQ) feature proposed in this patch set. The
idea behind SHARED_RUNQ is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which is the
implementation added in the final patch of the series (following the
initial unsharded implementation added in patch 6/7). The design is
described that way in this commit summary as the benchmarks described in
the results section below all reflect a sharded SHARED_RUNQ.

The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	raw_spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between SHARED_RUNQ and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas SHARED_RUNQ causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Worth noting as well is that pointed out in [3] that the logic behind
including SIS_NODE in the first place should apply to SHARED_RUNQ
(meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
should benefit from having a single shared_runq stretch across multiple
LLCs). I drafted a patch that implements this by having a minimum LLC
size for creating a shard, and stretches a shared_runq across multiple
LLCs if they're smaller than that size, and sent it to Tejun to test on
his Zen2. Tejun reported back that SIS_NODE did not seem to make a
difference:

[3]: https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
Vanilla:		    | 108.84s    | 0.0057   |
NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
SHARED_RUNQ:		    | 108.17s    | 0.038s   |
SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
			    o------------o----------o

I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ (though
a gain was observed relative to NO_SHARED_RUNQ, as described below).

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed).

Finally, note that these results were from the patch set built off of
commit ebb83d84e49b ("sched/core: Avoid multiple calling
update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core branch
of tip.git for easy comparison with the v2 patch set results. The
patches in their final form from this set were rebased onto commit
88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
use") on the sched/core branch of tip.git.

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
SHARED_RUNQ:		    | 577.02s    | 0.084s   |
			    o------------o----------o

Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks in
the shared runqueue on every enqueue improves work conservation, and
thanks to sharding, does not result in contention.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
SHARED_RUNQ:		    | 2.0217s    | .00065s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
result given that it seems advantageous to err on the side of avoiding
migration in hackbench given that tasks are short lived in sending only
10k bytes worth of messages, but the results of the benchmark would seem
to suggest that minimizing runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
SHARED_RUNQ:                | 24952.50   | 1268.06  |
                            o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
			    o------------o----------o

Takeaway: There's on statistically significant gain here. I observed
what I claimed was a .23% win in v2, but it appears that this is not
actually statistically significant.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
SHARED_RUNQ:		    | 5.2668s    | .0033s   |
                            o------------o----------o

Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
10+% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
SHARED_RUNQ:                | 14966.42   | 714.13   |
                            o-----------------------o

Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result makes
sense -- the workload is very heavy on the runqueue, so enqueuing tasks
in the shared runqueue in __enqueue_entity() would intuitively result in
increased contention on the shard lock.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
			    o------------o----------o

Takeaway: No statistically significant difference here. It might be
worth experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
SHARED_RUNQ:		    | 5.1716s    | .00289s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ again wins, by about 2%.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
SHARED_RUNQ:                | 16697.25   | 9812.23  |
                            o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here, so
the results of this benchmark should be taken with a large grain of
salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
not contending on the shard lock).

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [4]:

schbench -L -m 52 -p 512 -r 10 -t 1

[4]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

SHARED_RUNQ in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: SHARED_RUNQ encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(1) pull from a single
per-LLC shared runq.

For the workloads above, kernel compile and hackbench were clear winners
for SHARED_RUNQ (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from SHARED_RUNQ, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Worth noting as well is that Gautham Shenoy ran some interesting
experiments on a few more ideas in [5], such as walking the shared_runq
on the pop path until a task is found that can be migrated to the
calling CPU. I didn't run those experiments in this patch set, but it
might be worth doing so.

[5]: https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/

Gautham also ran some other benchmarks in [6], which we may want to
again try on this v3, but with boost disabled.

[6]: https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/

Finally, while SHARED_RUNQ in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runq's. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runq's, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>

David Vernet (7):
  sched: Expose move_queued_task() from core.c
  sched: Move is_cpu_allowed() into sched.h
  sched: Check cpu_active() earlier in newidle_balance()
  sched: Enable sched_feat callbacks on enable/disable
  sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
  sched: Implement shared runqueue in CFS
  sched: Shard per-LLC shared runqueues

 include/linux/sched.h   |   2 +
 kernel/sched/core.c     |  52 ++----
 kernel/sched/debug.c    |  18 ++-
 kernel/sched/fair.c     | 340 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/features.h |   1 +
 kernel/sched/sched.h    |  56 ++++++-
 kernel/sched/topology.c |   4 +-
 7 files changed, 420 insertions(+), 53 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 1/7] sched: Expose move_queued_task() from core.c
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 22:12 ` [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

The migrate_task_to() function exposed from kernel/sched/core.c migrates
the current task, which is silently assumed to also be its first
argument, to the specified CPU. The function uses stop_one_cpu() to
migrate the task to the target CPU, which won't work if @p is not the
current task as the stop_one_cpu() callback isn't invoked on remote
CPUs.

While this operation is useful for task_numa_migrate() in fair.c, it
would be useful if move_queued_task() in core.c was given external
linkage, as it actually can be used to migrate any task to a CPU.

A follow-on patch will call move_queued_task() from fair.c when
migrating a task in a shared runqueue to a remote CPU.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 4 ++--
 kernel/sched/sched.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 614271a75525..394e216b9d37 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2519,8 +2519,8 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
  *
  * Returns (locked) new rq. Old rq's lock is released.
  */
-static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+			    struct task_struct *p, int new_cpu)
 {
 	lockdep_assert_rq_held(rq);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19af1766df2d..69b100267fd0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1763,6 +1763,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_SMP
 
+
+extern struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
+				   struct task_struct *p, int new_cpu);
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
  2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 22:12 ` [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

is_cpu_allowed() exists as a static inline function in core.c. The
functionality offered by is_cpu_allowed() is useful to scheduling
policies as well, e.g. to determine whether a runnable task can be
migrated to another core that would otherwise go idle.

Let's move it to sched.h.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  | 31 -------------------------------
 kernel/sched/sched.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 394e216b9d37..dd6412a49263 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -48,7 +48,6 @@
 #include <linux/kcov.h>
 #include <linux/kprobes.h>
 #include <linux/llist_api.h>
-#include <linux/mmu_context.h>
 #include <linux/mmzone.h>
 #include <linux/mutex_api.h>
 #include <linux/nmi.h>
@@ -2470,36 +2469,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 	return rq->nr_pinned;
 }
 
-/*
- * Per-CPU kthreads are allowed to run on !active && online CPUs, see
- * __set_cpus_allowed_ptr() and select_fallback_rq().
- */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-{
-	/* When not in the task's cpumask, no point in looking further. */
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
-		return false;
-
-	/* migrate_disabled() must be allowed to finish. */
-	if (is_migration_disabled(p))
-		return cpu_online(cpu);
-
-	/* Non kernel threads are not allowed during either online or offline. */
-	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu) && task_cpu_possible(cpu, p);
-
-	/* KTHREAD_IS_PER_CPU is always allowed. */
-	if (kthread_is_per_cpu(p))
-		return cpu_online(cpu);
-
-	/* Regular kernel threads don't get to stay during offline. */
-	if (cpu_dying(cpu))
-		return false;
-
-	/* But are allowed during online. */
-	return cpu_online(cpu);
-}
-
 /*
  * This is how migration works:
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69b100267fd0..88cca7cc00cf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/lockdep.h>
 #include <linux/minmax.h>
 #include <linux/mm.h>
+#include <linux/mmu_context.h>
 #include <linux/module.h>
 #include <linux/mutex_api.h>
 #include <linux/plist.h>
@@ -1203,6 +1204,36 @@ static inline bool is_migration_disabled(struct task_struct *p)
 #endif
 }
 
+/*
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
+ * __set_cpus_allowed_ptr() and select_fallback_rq().
+ */
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+	/* When not in the task's cpumask, no point in looking further. */
+	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+		return false;
+
+	/* migrate_disabled() must be allowed to finish. */
+	if (is_migration_disabled(p))
+		return cpu_online(cpu);
+
+	/* Non kernel threads are not allowed during either online or offline. */
+	if (!(p->flags & PF_KTHREAD))
+		return cpu_active(cpu) && task_cpu_possible(cpu, p);
+
+	/* KTHREAD_IS_PER_CPU is always allowed. */
+	if (kthread_is_per_cpu(p))
+		return cpu_online(cpu);
+
+	/* Regular kernel threads don't get to stay during offline. */
+	if (cpu_dying(cpu))
+		return false;
+
+	/* But are allowed during online. */
+	return cpu_online(cpu);
+}
+
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance()
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
  2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
  2023-08-09 22:12 ` [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 22:12 ` [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable David Vernet
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

In newidle_balance(), we check if the current CPU is inactive, and then
decline to pull any remote tasks to the core if so. Before this check,
however, we're currently updating rq->idle_stamp. If a core is offline,
setting its idle stamp is not useful. The core won't be chosen by any
task in select_task_rq_fair(), and setting the rq->idle_stamp is
misleading anyways given that the core being inactive should imply that
it should have a very cold cache.

Let's set rq->idle_stamp in newidle_balance() only if the cpu is active.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28206499a3d..eb15d6f46479 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12037,18 +12037,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (this_rq->ttwu_pending)
 		return 0;
 
-	/*
-	 * We must set idle_stamp _before_ calling idle_balance(), such that we
-	 * measure the duration of idle_balance() as idle time.
-	 */
-	this_rq->idle_stamp = rq_clock(this_rq);
-
 	/*
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	/*
+	 * We must set idle_stamp _before_ calling idle_balance(), such that we
+	 * measure the duration of idle_balance() as idle time.
+	 */
+	this_rq->idle_stamp = rq_clock(this_rq);
+
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
 	 * for load-balance and preemption/IRQs are still disabled avoiding
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (2 preceding siblings ...)
  2023-08-09 22:12 ` [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 22:12 ` [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

When a scheduler feature is enabled or disabled, the sched_feat_enable()
and sched_feat_disable() functions are invoked respectively for that
feature. For features that don't require resetting any state, this works
fine. However, there will be an upcoming feature called shared_runq
which needs to drain all tasks from a set of global shared runqueues in
order to avoid stale tasks from staying in the queues after the feature
has been disabled.

This patch therefore defines a new SCHED_FEAT_CALLBACK macro which
allows scheduler features to specify a callback that should be invoked
when a feature is enabled or disabled respectively. The SCHED_FEAT macro
assumes a NULL callback.

Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/core.c  |  4 ++--
 kernel/sched/debug.c | 18 ++++++++++++++----
 kernel/sched/sched.h | 16 ++++++++++------
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd6412a49263..385c565da87f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -124,12 +124,12 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
  * sysctl_sched_features, defined in sched.h, to allow constants propagation
  * at compile time and compiler optimization based on features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 const_debug unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 /*
  * Print a warning if need_resched is set for the given duration (if
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aeeba46a096b..803dff75c56f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -44,14 +44,14 @@ static unsigned long nsec_low(unsigned long long nsec)
 
 #define SPLIT_NS(x) nsec_high(x), nsec_low(x)
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	#name ,
 
 static const char * const sched_feat_names[] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 static int sched_feat_show(struct seq_file *m, void *v)
 {
@@ -72,22 +72,32 @@ static int sched_feat_show(struct seq_file *m, void *v)
 #define jump_label_key__true  STATIC_KEY_INIT_TRUE
 #define jump_label_key__false STATIC_KEY_INIT_FALSE
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	jump_label_key__##enabled ,
 
 struct static_key sched_feat_keys[__SCHED_FEAT_NR] = {
 #include "features.h"
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
+
+#define SCHED_FEAT_CALLBACK(name, enabled, cb) cb,
+static const sched_feat_change_f sched_feat_cbs[__SCHED_FEAT_NR] = {
+#include "features.h"
+};
+#undef SCHED_FEAT_CALLBACK
 
 static void sched_feat_disable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](false);
 	static_key_disable_cpuslocked(&sched_feat_keys[i]);
 }
 
 static void sched_feat_enable(int i)
 {
+	if (sched_feat_cbs[i])
+		sched_feat_cbs[i](true);
 	static_key_enable_cpuslocked(&sched_feat_keys[i]);
 }
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88cca7cc00cf..2631da3c8a4d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2065,6 +2065,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+#define SCHED_FEAT(name, enabled) SCHED_FEAT_CALLBACK(name, enabled, NULL)
+
 /*
  * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
  */
@@ -2074,7 +2076,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 # define const_debug const
 #endif
 
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	__SCHED_FEAT_##name ,
 
 enum {
@@ -2082,7 +2084,7 @@ enum {
 	__SCHED_FEAT_NR,
 };
 
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #ifdef CONFIG_SCHED_DEBUG
 
@@ -2093,14 +2095,14 @@ enum {
 extern const_debug unsigned int sysctl_sched_features;
 
 #ifdef CONFIG_JUMP_LABEL
-#define SCHED_FEAT(name, enabled)					\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)				\
 static __always_inline bool static_branch_##name(struct static_key *key) \
 {									\
 	return static_key_##enabled(key);				\
 }
 
 #include "features.h"
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
@@ -2118,17 +2120,19 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
  * constants propagation at compile time and compiler optimization based on
  * features default.
  */
-#define SCHED_FEAT(name, enabled)	\
+#define SCHED_FEAT_CALLBACK(name, enabled, cb)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #include "features.h"
 	0;
-#undef SCHED_FEAT
+#undef SCHED_FEAT_CALLBACK
 
 #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 
 #endif /* SCHED_DEBUG */
 
+typedef void (*sched_feat_change_f)(bool enabling);
+
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (3 preceding siblings ...)
  2023-08-09 22:12 ` [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

For certain workloads in CFS, CPU utilization is of the upmost
importance. For example, at Meta, our main web workload benefits from a
1 - 1.5% improvement in RPS, and a 1 - 2% improvement in p99 latency,
when CPU utilization is pushed as high as possible.

This is likely something that would be useful for any workload with long
slices, or for which avoiding migration is unlikely to result in
improved cache locality.

We will soon be enabling more aggressive load balancing via a new
feature called shared_runq, which places tasks into a FIFO queue in the
__enqueue_entity(), wakeup path, and then opportunistically dequeues
them in newidle_balance(). We don't want to enable the feature by
default, so this patch defines and declares a new scheduler feature
called SHARED_RUNQ which is disabled by default.

A set of future patches will implement these functions, and enable
shared_runq for both single and multi socket / CCX architectures.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c     | 34 ++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h    |  1 +
 3 files changed, 37 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb15d6f46479..9c23e3b948fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -140,6 +140,20 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
 #ifdef CONFIG_SMP
+void shared_runq_toggle(bool enabling)
+{}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{}
+
+static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
+{
+	return 0;
+}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
+
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
  */
@@ -162,6 +176,15 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: ~5%)
  */
 #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
+#else
+void shared_runq_toggle(bool enabling)
+{}
+
+static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
+{}
+
+static void shared_runq_dequeue_task(struct task_struct *p)
+{}
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -642,11 +665,15 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_enqueue_task(rq_of(cfs_rq), task_of(se));
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (sched_feat(SHARED_RUNQ) && entity_is_task(se))
+		shared_runq_dequeue_task(task_of(se));
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
 }
 
@@ -12043,6 +12070,12 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
+	if (sched_feat(SHARED_RUNQ)) {
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task)
+			return pulled_task;
+	}
+
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
@@ -12543,6 +12576,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+	shared_runq_dequeue_task(p);
 	detach_task_cfs_rq(p);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e10074cb4be4..6d7c93fc1a8f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,5 @@ SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
 
 SCHED_FEAT(HZ_BW, true)
+
+SCHED_FEAT_CALLBACK(SHARED_RUNQ, false, shared_runq_toggle)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2631da3c8a4d..a484bb527ee4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2132,6 +2132,7 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 #endif /* SCHED_DEBUG */
 
 typedef void (*sched_feat_change_f)(bool enabling);
+extern void shared_runq_toggle(bool enabling);
 
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (4 preceding siblings ...)
  2023-08-09 22:12 ` [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-10  7:11   ` kernel test robot
                     ` (2 more replies)
  2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

Overview
========

The scheduler must constantly strike a balance between work
conservation, and avoiding costly migrations which harm performance due
to e.g. decreased cache locality. The matter is further complicated by
the topology of the system. Migrating a task between cores on the same
LLC may be more optimal than keeping a task local to the CPU, whereas
migrating a task between LLCs or NUMA nodes may tip the balance in the
other direction.

With that in mind, while CFS is by and large mostly a work conserving
scheduler, there are certain instances where the scheduler will choose
to keep a task local to a CPU, when it would have been more optimal to
migrate it to an idle core.

An example of such a workload is the HHVM / web workload at Meta. HHVM
is a VM that JITs Hack and PHP code in service of web requests. Like
other JIT / compilation workloads, it tends to be heavily CPU bound, and
exhibit generally poor cache locality. To try and address this, we set
several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:

- migration_cost_ns -> 0
- latency_ns -> 20000000
- min_granularity_ns -> 10000000
- wakeup_granularity_ns -> 12000000

These knobs are intended both to encourage the scheduler to be as work
conserving as possible (migration_cost_ns -> 0), and also to keep tasks
running for relatively long time slices so as to avoid the overhead of
context switching (the other knobs). Collectively, these knobs provide a
substantial performance win; resulting in roughly a 20% improvement in
throughput. Worth noting, however, is that this improvement is _not_ at
full machine saturation.

That said, even with these knobs, we noticed that CPUs were still going
idle even when the host was overcommitted. In response, we wrote the
"shared runqueue" (SHARED_RUNQ) feature proposed in this patch set. The
idea behind SHARED_RUNQ is simple: it enables the scheduler to be more
aggressively work conserving by placing a waking task into a sharded
per-LLC FIFO queue that can be pulled from by another core in the LLC
FIFO queue which can then be pulled from before it goes idle.

With this simple change, we were able to achieve a 1 - 1.6% improvement
in throughput, as well as a small, consistent improvement in p95 and p99
latencies, in HHVM. These performance improvements were in addition to
the wins from the debugfs knobs mentioned above, and to other benchmarks
outlined below in the Results section.

Design
======

Note that the design described here reflects sharding, which will be
added in a subsequent patch. The design is described that way in this
commit summary as the benchmarks described in the results section below
all include sharded SHARED_RUNQ. The patches are not combined into one
to ease the burden of review.

The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
list of struct shared_runq_shard objects, which itself is simply a
struct list_head of tasks, and a spinlock:

struct shared_runq_shard {
	struct list_head list;
	raw_spinlock_t lock;
} ____cacheline_aligned;

struct shared_runq {
	u32 num_shards;
	struct shared_runq_shard shards[];
} ____cacheline_aligned;

We create a struct shared_runq per LLC, ensuring they're in their own
cachelines to avoid false sharing between CPUs on different LLCs, and we
create a number of struct shared_runq_shard objects that are housed
there.

When a task first wakes up, it enqueues itself in the shared_runq_shard
of its current LLC at the end of enqueue_task_fair(). Enqueues only
happen if the task was not manually migrated to the current core by
select_task_rq(), and is not pinned to a specific CPU.

A core will pull a task from the shards in its LLC's shared_runq at the
beginning of newidle_balance().

Difference between SHARED_RUNQ and SIS_NODE
===========================================

In [0] Peter proposed a patch that addresses Tejun's observations that
when workqueues are targeted towards a specific LLC on his Zen2 machine
with small CCXs, that there would be significant idle time due to
select_idle_sibling() not considering anything outside of the current
LLC.

This patch (SIS_NODE) is essentially the complement to the proposal
here. SID_NODE causes waking tasks to look for idle cores in neighboring
LLCs on the same die, whereas SHARED_RUNQ causes cores about to go idle
to look for enqueued tasks. That said, in its current form, the two
features at are a different scope as SIS_NODE searches for idle cores
between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.

The patch was since removed in [1], and we compared the results to
shared_Runq (previously called "swqueue") in [2]. SIS_NODE did not
outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
compare against it again for this v2 patch set.

[0]: https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
[1]: https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
[2]: https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/

Worth noting as well is that pointed out in [3] that the logic behind
including SIS_NODE in the first place should apply to SHARED_RUNQ
(meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
should benefit from having a single shared_runq stretch across multiple
LLCs). I drafted a patch that implements this by having a minimum LLC
size for creating a shard, and stretches a shared_runq across multiple
LLCs if they're smaller than that size, and sent it to Tejun to test on
his Zen2. Tejun reported back that SIS_NODE did not seem to make a
difference:

[3]: https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
Vanilla:		    | 108.84s    | 0.0057   |
NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
SHARED_RUNQ:		    | 108.17s    | 0.038s   |
SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
			    o------------o----------o

I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ (though
a gain was observed relative to NO_SHARED_RUNQ, as described below).

Results
=======

Note that the motivation for the shared runqueue feature was originally
arrived at using experiments in the sched_ext framework that's currently
being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
is similarly visible using work-conserving sched_ext schedulers (even
very simple ones like global FIFO).

In both single and multi socket / CCX hosts, this can measurably improve
performance. In addition to the performance gains observed on our
internal web workloads, we also observed an improvement in common
workloads such as kernel compile and hackbench, when running shared
runqueue.

On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
-m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding the
shared datastructures within a CCX, but it doesn't seem to eliminate all
contention in every scenario. On the positive side, it seems that
sharding does not materially harm the benchmarks run for this patch
series; and in fact seems to improve some workloads such as kernel
compile.

Note that for the kernel compile workloads below, the compilation was
done by running make -j$(nproc) built-in.a on several different types of
hosts configured with make allyesconfig on commit a27648c74210 ("afs:
Fix setting of mtime when creating a file/dir/symlink") on Linus' tree
(boost and turbo were disabled on all of these hosts when the
experiments were performed).

Finally, note that these results were from the patch set built off of
commit ebb83d84e49b ("sched/core: Avoid multiple calling
update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core branch
of tip.git for easy comparison with the v2 patch set results. The
patches in their final form from this set were rebased onto commit
88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
use").

=== Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===

CPU max MHz: 5879.8818
CPU min MHz: 3000.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
SHARED_RUNQ:		    | 577.02s    | 0.084s   |
			    o------------o----------o

Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks in
the shared runqueue on every enqueue improves work conservation, and
thanks to sharding, does not result in contention.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
SHARED_RUNQ:		    | 2.0217s    | .00065s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ in both forms performs exceptionally well compared
to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
result given that it seems advantageous to err on the side of avoiding
migration in hackbench given that tasks are short lived in sending only
10k bytes worth of messages, but the results of the benchmark would seem
to suggest that minimizing runqueue delays is preferable.

Command:
for i in `seq 128`; do
    netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
SHARED_RUNQ:                | 24952.50   | 1268.06  |
                            o-----------------------o

Takeaway: No statistical significance, though it is worth noting that
there is no regression for shared runqueue on the 7950X, while there is
a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
described below.

=== Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===

CPU max MHz: 1601.0000
CPU min MHz: 800.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
			    o------------o----------o

Takeaway: There's on statistically significant gain here. I observed
what I claimed was a .23% win in v2, but it appears that this is not
actually statistically significant.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
SHARED_RUNQ:		    | 5.2668s    | .0033s   |
                            o------------o----------o

Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
NO_SHARED_RUNQ. Also statistically significant, but smaller than the
10+% improvement observed on the 7950X.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
SHARED_RUNQ:                | 14966.42   | 714.13   |
                            o-----------------------o

Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result makes
sense -- the workload is very heavy on the runqueue, so enqueuing tasks
in the shared runqueue in __enqueue_entity() would intuitively result in
increased contention on the shard lock.

=== Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===

CPU max MHz: 700.0000
CPU min MHz: 700.0000

Command: make -j$(nproc) built-in.a
			    o____________o__________o
			    |    mean    | Variance |
			    o------------o----------o
NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
			    o------------o----------o

Takeaway: No statistically significant difference here. It might be
worth experimenting with work stealing in a follow-on patch set.

Command: hackbench --loops 10000
                            o____________o__________o
                            |    mean    | Variance |
                            o------------o----------o
NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
SHARED_RUNQ:		    | 5.1716s    | .00289s  |
                            o------------o----------o

Takeaway: SHARED_RUNQ again wins, by about 2%.

Command: netperf -n $(nproc) -l 60 -t TCP_RR
for i in `seq 128`; do
        netperf -6 -t UDP_RR -c -C -l $runtime &
done
                            o_______________________o
                            | Throughput | Variance |
                            o-----------------------o
NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
SHARED_RUNQ:                | 16697.25   | 9812.23  |
                            o-----------------------o

Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here, so
the results of this benchmark should be taken with a large grain of
salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
not contending on the shard lock).

Finally, let's look at how sharding affects the following schbench
incantation suggested by Chris in [4]:

schbench -L -m 52 -p 512 -r 10 -t 1

[4]: https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

The TL;DR is that sharding improves things a lot, but doesn't completely
fix the problem. Here are the results from running the schbench command
on the 18 core / 36 thread single CCX, single-socket Skylake:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

These results are when we create only 3 shards (16 logical cores per
shard), so the contention may be a result of overly-coarse sharding. If
we run the schbench incantation with no sharding whatsoever, we see the
following significantly worse lock stats contention:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the first workload run above.
If we make the shards even smaller, the contention is comparably much
lower:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan as well, but we contend more on the rq lock than the
shard lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0

Worth noting is that increasing the granularity of the shards in general
improves very runqueue-heavy workloads such as netperf UDP_RR and this
schbench command, but it doesn't necessarily make a big difference for
every workload, or for sufficiently small CCXs such as the 7950X. It may
make sense to eventually allow users to control this with a debugfs
knob, but for now we'll elect to choose a default that resulted in good
performance for the benchmarks run for this patch series.

Conclusion
==========

SHARED_RUNQ in this form provides statistically significant wins for
several types of workloads, and various CPU topologies. The reason for
this is roughly the same for all workloads: SHARED_RUNQ encourages work
conservation inside of a CCX by having a CPU do an O(# per-LLC shards)
iteration over the shared_runq shards in an LLC. We could similarly do
an O(n) iteration over all of the runqueues in the current LLC when a
core is going idle, but that's quite costly (especially for larger
LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
ground between doing an O(n) walk, and doing an O(1) pull from a single
per-LLC shared runq.

For the workloads above, kernel compile and hackbench were clear winners
for SHARED_RUNQ (especially in __enqueue_entity()). The reason for the
improvement in kernel compile is of course that we have a heavily
CPU-bound workload where cache locality doesn't mean much; getting a CPU
is the #1 goal. As mentioned above, while I didn't expect to see an
improvement in hackbench, the results of the benchmark suggest that
minimizing runqueue delays is preferable to optimizing for L1/L2
locality.

Not all workloads benefit from SHARED_RUNQ, however. Workloads that
hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m 52
-p 512 -r 10 -t 1, tend to run into contention on the shard locks;
especially when enqueuing tasks in __enqueue_entity(). This can be
mitigated significantly by sharding the shared datastructures within a
CCX, but it doesn't eliminate all contention, as described above.

Finally, while SHARED_RUNQ in this form encourages work conservation, it
of course does not guarantee it given that we don't implement any kind
of work stealing between shared_runq's. In the future, we could
potentially push CPU utilization even higher by enabling work stealing
between shared_runq's, likely between CCXs on the same NUMA node.

Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: David Vernet <void@manifault.com>
---
 include/linux/sched.h   |   2 +
 kernel/sched/core.c     |  13 +++
 kernel/sched/fair.c     | 238 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |   4 +
 kernel/sched/topology.c |   4 +-
 5 files changed, 256 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2aab7be46f7e..8238069fd852 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -769,6 +769,8 @@ struct task_struct {
 	unsigned long			wakee_flip_decay_ts;
 	struct task_struct		*last_wakee;
 
+	struct list_head		shared_runq_node;
+
 	/*
 	 * recent_used_cpu is initially set as the last CPU used by a task
 	 * that wakes affine another task. Waker/wakee relationships can
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c565da87f..fb7e71d3dc0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4529,6 +4529,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	INIT_LIST_HEAD(&p->shared_runq_node);
 #endif
 	init_sched_mm_cid(p);
 }
@@ -9764,6 +9765,18 @@ int sched_cpu_deactivate(unsigned int cpu)
 	return 0;
 }
 
+void sched_update_domains(void)
+{
+	const struct sched_class *class;
+
+	update_sched_domain_debugfs();
+
+	for_each_class(class) {
+		if (class->update_domains)
+			class->update_domains();
+	}
+}
+
 static void sched_rq_cpu_starting(unsigned int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c23e3b948fc..6e740f8da578 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,20 +139,235 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
+ * pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq
+ * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
+ * to being pulled from the shared_runq, in which case they're simply dequeued
+ * from the shared_runq in __dequeue_entity().
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * A shared_runq is comprised of a list, and a spinlock for synchronization.
+ * Given that the critical section for a shared_runq is typically a fast list
+ * operation, and that the shared_runq is localized to a single LLC, the
+ * spinlock will typically only be contended on workloads that do little else
+ * other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq {
+	struct list_head list;
+	raw_spinlock_t lock;
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
+
+static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
+
+static struct shared_runq *rq_shared_runq(struct rq *rq)
+{
+	return rq->cfs.shared_runq;
+}
+
+static void shared_runq_reassign_domains(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+	struct rq *rq;
+	struct rq_flags rf;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+		rq_lock(rq, &rf);
+		rq->cfs.shared_runq = shared_runq;
+		rq_unlock(rq, &rf);
+	}
+}
+
+static void __shared_runq_drain(struct shared_runq *shared_runq)
+{
+	struct task_struct *p, *tmp;
+
+	raw_spin_lock(&shared_runq->lock);
+	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
+		list_del_init(&p->shared_runq_node);
+	raw_spin_unlock(&shared_runq->lock);
+}
+
+static void update_domains_fair(void)
+{
+	int i;
+	struct shared_runq *shared_runq;
+
+	/* Avoid racing with SHARED_RUNQ enable / disable. */
+	lockdep_assert_cpus_held();
+
+	shared_runq_reassign_domains();
+
+	/* Ensure every core sees its updated shared_runq pointers. */
+	synchronize_rcu();
+
+	/*
+	 * Drain all tasks from all shared_runq's to ensure there are no stale
+	 * tasks in any prior domain runq. This can cause us to drain live
+	 * tasks that would otherwise have been safe to schedule, but this
+	 * isn't a practical problem given how infrequently domains are
+	 * rebuilt.
+	 */
+	for_each_possible_cpu(i) {
+		shared_runq = &per_cpu(shared_runqs, i);
+		__shared_runq_drain(shared_runq);
+	}
+}
+
 void shared_runq_toggle(bool enabling)
-{}
+{
+	int cpu;
+
+	if (enabling)
+		return;
+
+	/* Avoid racing with hotplug. */
+	lockdep_assert_cpus_held();
+
+	/* Ensure all cores have stopped enqueueing / dequeuing tasks. */
+	synchronize_rcu();
+
+	for_each_possible_cpu(cpu) {
+		int sd_id;
+
+		sd_id = per_cpu(sd_llc_id, cpu);
+		if (cpu == sd_id)
+			__shared_runq_drain(rq_shared_runq(cpu_rq(cpu)));
+	}
+}
+
+static struct task_struct *shared_runq_pop_task(struct rq *rq)
+{
+	struct task_struct *p;
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	if (list_empty(&shared_runq->list))
+		return NULL;
+
+	raw_spin_lock(&shared_runq->lock);
+	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+				     shared_runq_node);
+	if (p && is_cpu_allowed(p, cpu_of(rq)))
+		list_del_init(&p->shared_runq_node);
+	else
+		p = NULL;
+	raw_spin_unlock(&shared_runq->lock);
+
+	return p;
+}
+
+static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+{
+	struct shared_runq *shared_runq;
+
+	shared_runq = rq_shared_runq(rq);
+	raw_spin_lock(&shared_runq->lock);
+	list_add_tail(&p->shared_runq_node, &shared_runq->list);
+	raw_spin_unlock(&shared_runq->lock);
+}
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
-{}
+{
+	/*
+	 * Only enqueue the task in the shared runqueue if:
+	 *
+	 * - SHARED_RUNQ is enabled
+	 * - The task isn't pinned to a specific CPU
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	shared_runq_push_task(rq, p);
+}
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 {
-	return 0;
+	struct task_struct *p = NULL;
+	struct rq *src_rq;
+	struct rq_flags src_rf;
+	int ret = -1;
+
+	p = shared_runq_pop_task(rq);
+	if (!p)
+		return 0;
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	src_rq = task_rq_lock(p, &src_rf);
+
+	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
+		update_rq_clock(src_rq);
+		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
+		ret = 1;
+	}
+
+	if (src_rq != rq) {
+		task_rq_unlock(src_rq, p, &src_rf);
+		raw_spin_rq_lock(rq);
+	} else {
+		rq_unpin_lock(rq, &src_rf);
+		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
+	}
+	rq_repin_lock(rq, rf);
+
+	return ret;
 }
 
 static void shared_runq_dequeue_task(struct task_struct *p)
-{}
+{
+	struct shared_runq *shared_runq;
+
+	if (!list_empty(&p->shared_runq_node)) {
+		shared_runq = rq_shared_runq(task_rq(p));
+		raw_spin_lock(&shared_runq->lock);
+		/*
+		 * Need to double-check for the list being empty to avoid
+		 * racing with the list being drained on the domain recreation
+		 * or SHARED_RUNQ feature enable / disable path.
+		 */
+		if (likely(!list_empty(&p->shared_runq_node)))
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shared_runq->lock);
+	}
+}
 
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	/*
+	 * Skip <= LLC domains as they likely won't have any tasks if the
+	 * shared runq is empty.
+	 */
+	if (sched_feat(SHARED_RUNQ)) {
+		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+		if (likely(sd))
+			sd = sd->parent;
+	}
+
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
@@ -12969,6 +13194,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.task_dead		= task_dead_fair,
 	.set_cpus_allowed	= set_cpus_allowed_common,
+	.update_domains		= update_domains_fair,
 #endif
 
 	.task_tick		= task_tick_fair,
@@ -13035,6 +13261,7 @@ __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
 	int i;
+	struct shared_runq *shared_runq;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13044,6 +13271,9 @@ __init void init_sched_fair_class(void)
 		INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
+		shared_runq = &per_cpu(shared_runqs, i);
+		INIT_LIST_HEAD(&shared_runq->list);
+		raw_spin_lock_init(&shared_runq->lock);
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a484bb527ee4..3665dd935649 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -487,9 +487,11 @@ extern int sched_group_set_idle(struct task_group *tg, long idle);
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
+extern void sched_update_domains(void);
 #else /* !CONFIG_SMP */
 static inline void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next) { }
+static inline void sched_update_domains(void) {}
 #endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -578,6 +580,7 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+	struct shared_runq	*shared_runq;
 	/*
 	 * CFS load tracking
 	 */
@@ -2282,6 +2285,7 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 
 	struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
+	void (*update_domains)(void);
 #endif
 
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a5bc678c08..8aaf644d4f2c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2576,6 +2576,8 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
 		doms_cur = &fallback_doms;
 	cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
 	err = build_sched_domains(doms_cur[0], NULL);
+	if (!err)
+		sched_update_domains();
 
 	return err;
 }
@@ -2741,7 +2743,7 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
 	dattr_cur = dattr_new;
 	ndoms_cur = ndoms_new;
 
-	update_sched_domain_debugfs();
+	sched_update_domains();
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (5 preceding siblings ...)
  2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
@ 2023-08-09 22:12 ` David Vernet
  2023-08-09 23:46   ` kernel test robot
                     ` (2 more replies)
  2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 52+ messages in thread
From: David Vernet @ 2023-08-09 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 149 ++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 108 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e740f8da578..d67d86d3bfdf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -143,19 +143,27 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
  * runnable tasks within an LLC.
  *
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
  * WHAT
  * ====
  *
  * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
  *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in the calling CPU's struct shared_runq in
- * __enqueue_entity(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq in __dequeue_entity().
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
@@ -165,11 +173,12 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * HOW
  * ===
  *
- * A shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
  *
  * WHY
  * ===
@@ -183,11 +192,21 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * it, as well as to strike a balance between work conservation, and L3 cache
  * locality.
  */
-struct shared_runq {
+struct shared_runq_shard {
 	struct list_head list;
 	raw_spinlock_t lock;
 } ____cacheline_aligned;
 
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_MAX_SHARDS \
+	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
+
+struct shared_runq {
+	unsigned int num_shards;
+	struct shared_runq_shard shards[SHARED_RUNQ_MAX_SHARDS];
+} ____cacheline_aligned;
+
 #ifdef CONFIG_SMP
 
 static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
@@ -197,31 +216,61 @@ static struct shared_runq *rq_shared_runq(struct rq *rq)
 	return rq->cfs.shared_runq;
 }
 
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+	return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+	return (cpu >> 1) % runq->num_shards;
+}
+
 static void shared_runq_reassign_domains(void)
 {
 	int i;
 	struct shared_runq *shared_runq;
 	struct rq *rq;
 	struct rq_flags rf;
+	unsigned int num_shards, shard_idx;
+
+	for_each_possible_cpu(i) {
+		if (per_cpu(sd_llc_id, i) == i) {
+			shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
+
+			num_shards = per_cpu(sd_llc_size, i) / SHARED_RUNQ_SHARD_SZ;
+			if (per_cpu(sd_llc_size, i) % SHARED_RUNQ_SHARD_SZ)
+				num_shards++;
+			shared_runq->num_shards = num_shards;
+		}
+	}
 
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
 
+		shard_idx = shared_runq_shard_idx(shared_runq, i);
 		rq_lock(rq, &rf);
 		rq->cfs.shared_runq = shared_runq;
+		rq->cfs.shard = &shared_runq->shards[shard_idx];
 		rq_unlock(rq, &rf);
 	}
 }
 
 static void __shared_runq_drain(struct shared_runq *shared_runq)
 {
-	struct task_struct *p, *tmp;
+	unsigned int i;
 
-	raw_spin_lock(&shared_runq->lock);
-	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
-		list_del_init(&p->shared_runq_node);
-	raw_spin_unlock(&shared_runq->lock);
+	for (i = 0; i < shared_runq->num_shards; i++) {
+		struct shared_runq_shard *shard;
+		struct task_struct *p, *tmp;
+
+		shard = &shared_runq->shards[i];
+		raw_spin_lock(&shard->lock);
+		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
+			list_del_init(&p->shared_runq_node);
+		raw_spin_unlock(&shard->lock);
+	}
 }
 
 static void update_domains_fair(void)
@@ -272,35 +321,32 @@ void shared_runq_toggle(bool enabling)
 	}
 }
 
-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	struct task_struct *p;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	if (list_empty(&shared_runq->list))
+	if (list_empty(&shard->list))
 		return NULL;
 
-	raw_spin_lock(&shared_runq->lock);
-	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+	raw_spin_lock(&shard->lock);
+	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, cpu_of(rq)))
+	if (p && is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_unlock(&shard->lock);
 
 	return p;
 }
 
-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+				  struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
-
-	shared_runq = rq_shared_runq(rq);
-	raw_spin_lock(&shared_runq->lock);
-	list_add_tail(&p->shared_runq_node, &shared_runq->list);
-	raw_spin_unlock(&shared_runq->lock);
+	raw_spin_lock(&shard->lock);
+	list_add_tail(&p->shared_runq_node, &shard->list);
+	raw_spin_unlock(&shard->lock);
 }
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
@@ -314,7 +360,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
 	if (p->nr_cpus_allowed == 1)
 		return;
 
-	shared_runq_push_task(rq, p);
+	shared_runq_push_task(rq_shared_runq_shard(rq), p);
 }
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -322,9 +368,22 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	struct task_struct *p = NULL;
 	struct rq *src_rq;
 	struct rq_flags src_rf;
+	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
+	u32 i, starting_idx, curr_idx, num_shards;
 	int ret = -1;
 
-	p = shared_runq_pop_task(rq);
+	shared_runq = rq_shared_runq(rq);
+	num_shards = shared_runq->num_shards;
+	starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+	for (i = 0; i < num_shards; i++) {
+		curr_idx = (starting_idx + i) % num_shards;
+		shard = &shared_runq->shards[curr_idx];
+
+		p = shared_runq_pop_task(shard, cpu_of(rq));
+		if (p)
+			break;
+	}
 	if (!p)
 		return 0;
 
@@ -353,11 +412,11 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 
 static void shared_runq_dequeue_task(struct task_struct *p)
 {
-	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	if (!list_empty(&p->shared_runq_node)) {
-		shared_runq = rq_shared_runq(task_rq(p));
-		raw_spin_lock(&shared_runq->lock);
+		shard = rq_shared_runq_shard(task_rq(p));
+		raw_spin_lock(&shard->lock);
 		/*
 		 * Need to double-check for the list being empty to avoid
 		 * racing with the list being drained on the domain recreation
@@ -365,7 +424,7 @@ static void shared_runq_dequeue_task(struct task_struct *p)
 		 */
 		if (likely(!list_empty(&p->shared_runq_node)))
 			list_del_init(&p->shared_runq_node);
-		raw_spin_unlock(&shared_runq->lock);
+		raw_spin_unlock(&shard->lock);
 	}
 }
 
@@ -13260,8 +13319,9 @@ void show_numa_stats(struct task_struct *p, struct seq_file *m)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
-	int i;
+	int i, j;
 	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	for_each_possible_cpu(i) {
 		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
@@ -13272,8 +13332,11 @@ __init void init_sched_fair_class(void)
 		INIT_LIST_HEAD(&cpu_rq(i)->cfsb_csd_list);
 #endif
 		shared_runq = &per_cpu(shared_runqs, i);
-		INIT_LIST_HEAD(&shared_runq->list);
-		raw_spin_lock_init(&shared_runq->lock);
+		for (j = 0; j < SHARED_RUNQ_MAX_SHARDS; j++) {
+			shard = &shared_runq->shards[j];
+			INIT_LIST_HEAD(&shard->list);
+			raw_spin_lock_init(&shard->lock);
+		}
 	}
 
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3665dd935649..b504f8f4416b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,7 +580,8 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-	struct shared_runq	*shared_runq;
+	struct shared_runq	 *shared_runq;
+	struct shared_runq_shard *shard;
 	/*
 	 * CFS load tracking
 	 */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
@ 2023-08-09 23:46   ` kernel test robot
  2023-08-10  0:12     ` David Vernet
  2023-08-10  7:11   ` kernel test robot
  2023-08-30  6:17   ` Chen Yu
  2 siblings, 1 reply; 52+ messages in thread
From: kernel test robot @ 2023-08-09 23:46 UTC (permalink / raw)
  To: David Vernet, linux-kernel
  Cc: oe-kbuild-all, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team

Hi David,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[cannot apply to linus/master v6.5-rc5 next-20230809]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Vernet/sched-Expose-move_queued_task-from-core-c/20230810-061611
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20230809221218.163894-8-void%40manifault.com
patch subject: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20230810/202308100717.LGL1juJR-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308100717.LGL1juJR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308100717.LGL1juJR-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:198: warning: expecting prototype for struct shared_runq. Prototype was for struct shared_runq_shard instead


vim +198 kernel/sched/fair.c

05289b90c2e40a Thara Gopinath 2020-02-21  141  
7cc7fb0f3200dd David Vernet   2023-08-09  142  /**
7cc7fb0f3200dd David Vernet   2023-08-09  143   * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
7cc7fb0f3200dd David Vernet   2023-08-09  144   * runnable tasks within an LLC.
7cc7fb0f3200dd David Vernet   2023-08-09  145   *
54c971b941e0bd David Vernet   2023-08-09  146   * struct shared_runq_shard - A structure containing a task list and a spinlock
54c971b941e0bd David Vernet   2023-08-09  147   * for a subset of cores in a struct shared_runq.
54c971b941e0bd David Vernet   2023-08-09  148   *
7cc7fb0f3200dd David Vernet   2023-08-09  149   * WHAT
7cc7fb0f3200dd David Vernet   2023-08-09  150   * ====
7cc7fb0f3200dd David Vernet   2023-08-09  151   *
7cc7fb0f3200dd David Vernet   2023-08-09  152   * This structure enables the scheduler to be more aggressively work
54c971b941e0bd David Vernet   2023-08-09  153   * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
54c971b941e0bd David Vernet   2023-08-09  154   * then be pulled from when another core in the LLC is going to go idle.
54c971b941e0bd David Vernet   2023-08-09  155   *
54c971b941e0bd David Vernet   2023-08-09  156   * struct rq stores two pointers in its struct cfs_rq:
54c971b941e0bd David Vernet   2023-08-09  157   *
54c971b941e0bd David Vernet   2023-08-09  158   * 1. The per-LLC struct shared_runq which contains one or more shards of
54c971b941e0bd David Vernet   2023-08-09  159   *    enqueued tasks.
7cc7fb0f3200dd David Vernet   2023-08-09  160   *
54c971b941e0bd David Vernet   2023-08-09  161   * 2. The shard inside of the per-LLC struct shared_runq which contains the
54c971b941e0bd David Vernet   2023-08-09  162   *    list of runnable tasks for that shard.
54c971b941e0bd David Vernet   2023-08-09  163   *
54c971b941e0bd David Vernet   2023-08-09  164   * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
54c971b941e0bd David Vernet   2023-08-09  165   * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
54c971b941e0bd David Vernet   2023-08-09  166   * newidle_balance(). Pulling from shards is an O(# shards) operation.
7cc7fb0f3200dd David Vernet   2023-08-09  167   *
7cc7fb0f3200dd David Vernet   2023-08-09  168   * There is currently no task-stealing between shared_runqs in different LLCs,
7cc7fb0f3200dd David Vernet   2023-08-09  169   * which means that shared_runq is not fully work conserving. This could be
7cc7fb0f3200dd David Vernet   2023-08-09  170   * added at a later time, with tasks likely only being stolen across
7cc7fb0f3200dd David Vernet   2023-08-09  171   * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
7cc7fb0f3200dd David Vernet   2023-08-09  172   *
7cc7fb0f3200dd David Vernet   2023-08-09  173   * HOW
7cc7fb0f3200dd David Vernet   2023-08-09  174   * ===
7cc7fb0f3200dd David Vernet   2023-08-09  175   *
54c971b941e0bd David Vernet   2023-08-09  176   * A struct shared_runq_shard is comprised of a list, and a spinlock for
54c971b941e0bd David Vernet   2023-08-09  177   * synchronization.  Given that the critical section for a shared_runq is
54c971b941e0bd David Vernet   2023-08-09  178   * typically a fast list operation, and that the shared_runq_shard is localized
54c971b941e0bd David Vernet   2023-08-09  179   * to a subset of cores on a single LLC (plus other cores in the LLC that pull
54c971b941e0bd David Vernet   2023-08-09  180   * from the shard in newidle_balance()), the spinlock will typically only be
54c971b941e0bd David Vernet   2023-08-09  181   * contended on workloads that do little else other than hammer the runqueue.
7cc7fb0f3200dd David Vernet   2023-08-09  182   *
7cc7fb0f3200dd David Vernet   2023-08-09  183   * WHY
7cc7fb0f3200dd David Vernet   2023-08-09  184   * ===
7cc7fb0f3200dd David Vernet   2023-08-09  185   *
7cc7fb0f3200dd David Vernet   2023-08-09  186   * As mentioned above, the main benefit of shared_runq is that it enables more
7cc7fb0f3200dd David Vernet   2023-08-09  187   * aggressive work conservation in the scheduler. This can benefit workloads
7cc7fb0f3200dd David Vernet   2023-08-09  188   * that benefit more from CPU utilization than from L1/L2 cache locality.
7cc7fb0f3200dd David Vernet   2023-08-09  189   *
7cc7fb0f3200dd David Vernet   2023-08-09  190   * shared_runqs are segmented across LLCs both to avoid contention on the
7cc7fb0f3200dd David Vernet   2023-08-09  191   * shared_runq spinlock by minimizing the number of CPUs that could contend on
7cc7fb0f3200dd David Vernet   2023-08-09  192   * it, as well as to strike a balance between work conservation, and L3 cache
7cc7fb0f3200dd David Vernet   2023-08-09  193   * locality.
7cc7fb0f3200dd David Vernet   2023-08-09  194   */
54c971b941e0bd David Vernet   2023-08-09  195  struct shared_runq_shard {
7cc7fb0f3200dd David Vernet   2023-08-09  196  	struct list_head list;
7cc7fb0f3200dd David Vernet   2023-08-09  197  	raw_spinlock_t lock;
7cc7fb0f3200dd David Vernet   2023-08-09 @198  } ____cacheline_aligned;
7cc7fb0f3200dd David Vernet   2023-08-09  199  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-09 23:46   ` kernel test robot
@ 2023-08-10  0:12     ` David Vernet
  0 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-08-10  0:12 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-kernel, oe-kbuild-all, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy,
	kprateek.nayak, aaron.lu, wuyun.abel, kernel-team

On Thu, Aug 10, 2023 at 07:46:37AM +0800, kernel test robot wrote:
> Hi David,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on tip/sched/core]
> [cannot apply to linus/master v6.5-rc5 next-20230809]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/David-Vernet/sched-Expose-move_queued_task-from-core-c/20230810-061611
> base:   tip/sched/core
> patch link:    https://lore.kernel.org/r/20230809221218.163894-8-void%40manifault.com
> patch subject: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
> config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20230810/202308100717.LGL1juJR-lkp@intel.com/config)
> compiler: loongarch64-linux-gcc (GCC) 12.3.0
> reproduce: (https://download.01.org/0day-ci/archive/20230810/202308100717.LGL1juJR-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202308100717.LGL1juJR-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> kernel/sched/fair.c:198: warning: expecting prototype for struct shared_runq. Prototype was for struct shared_runq_shard instead

I'll split this comment up in v4.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
@ 2023-08-10  7:11   ` kernel test robot
  2023-08-10  7:41   ` kernel test robot
  2023-08-30  6:46   ` K Prateek Nayak
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2023-08-10  7:11 UTC (permalink / raw)
  To: David Vernet, linux-kernel
  Cc: oe-kbuild-all, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team

Hi David,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[cannot apply to linus/master v6.5-rc5 next-20230809]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Vernet/sched-Expose-move_queued_task-from-core-c/20230810-061611
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20230809221218.163894-7-void%40manifault.com
patch subject: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
config: sparc-randconfig-r015-20230809 (https://download.01.org/0day-ci/archive/20230810/202308101517.FuIh97h7-lkp@intel.com/config)
compiler: sparc-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308101517.FuIh97h7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308101517.FuIh97h7-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

>> kernel/sched/core.c:9768:6: warning: no previous prototype for 'sched_update_domains' [-Wmissing-prototypes]
    9768 | void sched_update_domains(void)
         |      ^~~~~~~~~~~~~~~~~~~~
--
   In file included from kernel/sched/build_utility.c:89:
   kernel/sched/topology.c: In function 'sched_init_domains':
>> kernel/sched/topology.c:2580:17: error: implicit declaration of function 'sched_update_domains'; did you mean 'sched_update_scaling'? [-Werror=implicit-function-declaration]
    2580 |                 sched_update_domains();
         |                 ^~~~~~~~~~~~~~~~~~~~
         |                 sched_update_scaling
   cc1: some warnings being treated as errors


vim +2580 kernel/sched/topology.c

  2558	
  2559	/*
  2560	 * Set up scheduler domains and groups.  For now this just excludes isolated
  2561	 * CPUs, but could be used to exclude other special cases in the future.
  2562	 */
  2563	int __init sched_init_domains(const struct cpumask *cpu_map)
  2564	{
  2565		int err;
  2566	
  2567		zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
  2568		zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
  2569		zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
  2570	
  2571		arch_update_cpu_topology();
  2572		asym_cpu_capacity_scan();
  2573		ndoms_cur = 1;
  2574		doms_cur = alloc_sched_domains(ndoms_cur);
  2575		if (!doms_cur)
  2576			doms_cur = &fallback_doms;
  2577		cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
  2578		err = build_sched_domains(doms_cur[0], NULL);
  2579		if (!err)
> 2580			sched_update_domains();
  2581	
  2582		return err;
  2583	}
  2584	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
  2023-08-09 23:46   ` kernel test robot
@ 2023-08-10  7:11   ` kernel test robot
  2023-08-30  6:17   ` Chen Yu
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2023-08-10  7:11 UTC (permalink / raw)
  To: David Vernet, linux-kernel
  Cc: llvm, oe-kbuild-all, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team

Hi David,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[cannot apply to linus/master v6.5-rc5 next-20230809]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Vernet/sched-Expose-move_queued_task-from-core-c/20230810-061611
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20230809221218.163894-8-void%40manifault.com
patch subject: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
config: hexagon-randconfig-r041-20230809 (https://download.01.org/0day-ci/archive/20230810/202308101540.7XQCJ2ea-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308101540.7XQCJ2ea-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308101540.7XQCJ2ea-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:198: warning: expecting prototype for struct shared_runq. Prototype was for struct shared_runq_shard instead


vim +198 kernel/sched/fair.c

05289b90c2e40ae Thara Gopinath 2020-02-21  141  
7cc7fb0f3200dd3 David Vernet   2023-08-09  142  /**
7cc7fb0f3200dd3 David Vernet   2023-08-09  143   * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
7cc7fb0f3200dd3 David Vernet   2023-08-09  144   * runnable tasks within an LLC.
7cc7fb0f3200dd3 David Vernet   2023-08-09  145   *
54c971b941e0bd0 David Vernet   2023-08-09  146   * struct shared_runq_shard - A structure containing a task list and a spinlock
54c971b941e0bd0 David Vernet   2023-08-09  147   * for a subset of cores in a struct shared_runq.
54c971b941e0bd0 David Vernet   2023-08-09  148   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  149   * WHAT
7cc7fb0f3200dd3 David Vernet   2023-08-09  150   * ====
7cc7fb0f3200dd3 David Vernet   2023-08-09  151   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  152   * This structure enables the scheduler to be more aggressively work
54c971b941e0bd0 David Vernet   2023-08-09  153   * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
54c971b941e0bd0 David Vernet   2023-08-09  154   * then be pulled from when another core in the LLC is going to go idle.
54c971b941e0bd0 David Vernet   2023-08-09  155   *
54c971b941e0bd0 David Vernet   2023-08-09  156   * struct rq stores two pointers in its struct cfs_rq:
54c971b941e0bd0 David Vernet   2023-08-09  157   *
54c971b941e0bd0 David Vernet   2023-08-09  158   * 1. The per-LLC struct shared_runq which contains one or more shards of
54c971b941e0bd0 David Vernet   2023-08-09  159   *    enqueued tasks.
7cc7fb0f3200dd3 David Vernet   2023-08-09  160   *
54c971b941e0bd0 David Vernet   2023-08-09  161   * 2. The shard inside of the per-LLC struct shared_runq which contains the
54c971b941e0bd0 David Vernet   2023-08-09  162   *    list of runnable tasks for that shard.
54c971b941e0bd0 David Vernet   2023-08-09  163   *
54c971b941e0bd0 David Vernet   2023-08-09  164   * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
54c971b941e0bd0 David Vernet   2023-08-09  165   * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
54c971b941e0bd0 David Vernet   2023-08-09  166   * newidle_balance(). Pulling from shards is an O(# shards) operation.
7cc7fb0f3200dd3 David Vernet   2023-08-09  167   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  168   * There is currently no task-stealing between shared_runqs in different LLCs,
7cc7fb0f3200dd3 David Vernet   2023-08-09  169   * which means that shared_runq is not fully work conserving. This could be
7cc7fb0f3200dd3 David Vernet   2023-08-09  170   * added at a later time, with tasks likely only being stolen across
7cc7fb0f3200dd3 David Vernet   2023-08-09  171   * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
7cc7fb0f3200dd3 David Vernet   2023-08-09  172   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  173   * HOW
7cc7fb0f3200dd3 David Vernet   2023-08-09  174   * ===
7cc7fb0f3200dd3 David Vernet   2023-08-09  175   *
54c971b941e0bd0 David Vernet   2023-08-09  176   * A struct shared_runq_shard is comprised of a list, and a spinlock for
54c971b941e0bd0 David Vernet   2023-08-09  177   * synchronization.  Given that the critical section for a shared_runq is
54c971b941e0bd0 David Vernet   2023-08-09  178   * typically a fast list operation, and that the shared_runq_shard is localized
54c971b941e0bd0 David Vernet   2023-08-09  179   * to a subset of cores on a single LLC (plus other cores in the LLC that pull
54c971b941e0bd0 David Vernet   2023-08-09  180   * from the shard in newidle_balance()), the spinlock will typically only be
54c971b941e0bd0 David Vernet   2023-08-09  181   * contended on workloads that do little else other than hammer the runqueue.
7cc7fb0f3200dd3 David Vernet   2023-08-09  182   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  183   * WHY
7cc7fb0f3200dd3 David Vernet   2023-08-09  184   * ===
7cc7fb0f3200dd3 David Vernet   2023-08-09  185   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  186   * As mentioned above, the main benefit of shared_runq is that it enables more
7cc7fb0f3200dd3 David Vernet   2023-08-09  187   * aggressive work conservation in the scheduler. This can benefit workloads
7cc7fb0f3200dd3 David Vernet   2023-08-09  188   * that benefit more from CPU utilization than from L1/L2 cache locality.
7cc7fb0f3200dd3 David Vernet   2023-08-09  189   *
7cc7fb0f3200dd3 David Vernet   2023-08-09  190   * shared_runqs are segmented across LLCs both to avoid contention on the
7cc7fb0f3200dd3 David Vernet   2023-08-09  191   * shared_runq spinlock by minimizing the number of CPUs that could contend on
7cc7fb0f3200dd3 David Vernet   2023-08-09  192   * it, as well as to strike a balance between work conservation, and L3 cache
7cc7fb0f3200dd3 David Vernet   2023-08-09  193   * locality.
7cc7fb0f3200dd3 David Vernet   2023-08-09  194   */
54c971b941e0bd0 David Vernet   2023-08-09  195  struct shared_runq_shard {
7cc7fb0f3200dd3 David Vernet   2023-08-09  196  	struct list_head list;
7cc7fb0f3200dd3 David Vernet   2023-08-09  197  	raw_spinlock_t lock;
7cc7fb0f3200dd3 David Vernet   2023-08-09 @198  } ____cacheline_aligned;
7cc7fb0f3200dd3 David Vernet   2023-08-09  199  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
  2023-08-10  7:11   ` kernel test robot
@ 2023-08-10  7:41   ` kernel test robot
  2023-08-30  6:46   ` K Prateek Nayak
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2023-08-10  7:41 UTC (permalink / raw)
  To: David Vernet, linux-kernel
  Cc: llvm, oe-kbuild-all, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team

Hi David,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[cannot apply to linus/master v6.5-rc5 next-20230809]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Vernet/sched-Expose-move_queued_task-from-core-c/20230810-061611
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20230809221218.163894-7-void%40manifault.com
patch subject: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
config: hexagon-randconfig-r045-20230809 (https://download.01.org/0day-ci/archive/20230810/202308101547.1n2K9PyC-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308101547.1n2K9PyC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308101547.1n2K9PyC-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
>> kernel/sched/core.c:9768:6: warning: no previous prototype for function 'sched_update_domains' [-Wmissing-prototypes]
    9768 | void sched_update_domains(void)
         |      ^
   kernel/sched/core.c:9768:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    9768 | void sched_update_domains(void)
         | ^
         | static 
   kernel/sched/core.c:2467:20: warning: unused function 'rq_has_pinned_tasks' [-Wunused-function]
    2467 | static inline bool rq_has_pinned_tasks(struct rq *rq)
         |                    ^
   kernel/sched/core.c:5818:20: warning: unused function 'sched_tick_stop' [-Wunused-function]
    5818 | static inline void sched_tick_stop(int cpu) { }
         |                    ^
   kernel/sched/core.c:6519:20: warning: unused function 'sched_core_cpu_deactivate' [-Wunused-function]
    6519 | static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
         |                    ^
   kernel/sched/core.c:6520:20: warning: unused function 'sched_core_cpu_dying' [-Wunused-function]
    6520 | static inline void sched_core_cpu_dying(unsigned int cpu) {}
         |                    ^
   kernel/sched/core.c:9570:20: warning: unused function 'balance_hotplug_wait' [-Wunused-function]
    9570 | static inline void balance_hotplug_wait(void)
         |                    ^
   12 warnings generated.
--
   In file included from kernel/sched/build_utility.c:15:
   In file included from include/linux/sched/isolation.h:6:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from kernel/sched/build_utility.c:15:
   In file included from include/linux/sched/isolation.h:6:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from kernel/sched/build_utility.c:15:
   In file included from include/linux/sched/isolation.h:6:
   In file included from include/linux/tick.h:8:
   In file included from include/linux/clockchips.h:14:
   In file included from include/linux/clocksource.h:22:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   In file included from kernel/sched/build_utility.c:89:
>> kernel/sched/topology.c:2580:3: error: call to undeclared function 'sched_update_domains'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2580 |                 sched_update_domains();
         |                 ^
   kernel/sched/topology.c:2746:2: error: call to undeclared function 'sched_update_domains'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2746 |         sched_update_domains();
         |         ^
   6 warnings and 2 errors generated.


vim +/sched_update_domains +2580 kernel/sched/topology.c

  2558	
  2559	/*
  2560	 * Set up scheduler domains and groups.  For now this just excludes isolated
  2561	 * CPUs, but could be used to exclude other special cases in the future.
  2562	 */
  2563	int __init sched_init_domains(const struct cpumask *cpu_map)
  2564	{
  2565		int err;
  2566	
  2567		zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
  2568		zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
  2569		zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
  2570	
  2571		arch_update_cpu_topology();
  2572		asym_cpu_capacity_scan();
  2573		ndoms_cur = 1;
  2574		doms_cur = alloc_sched_domains(ndoms_cur);
  2575		if (!doms_cur)
  2576			doms_cur = &fallback_doms;
  2577		cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
  2578		err = build_sched_domains(doms_cur[0], NULL);
  2579		if (!err)
> 2580			sched_update_domains();
  2581	
  2582		return err;
  2583	}
  2584	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (6 preceding siblings ...)
  2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
@ 2023-08-17  8:42 ` Gautham R. Shenoy
  2023-08-18  5:03   ` David Vernet
  2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
  2023-12-04 19:30 ` David Vernet
  9 siblings, 1 reply; 52+ messages in thread
From: Gautham R. Shenoy @ 2023-08-17  8:42 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> Changes
> -------
> 
> This is v3 of the shared runqueue patchset. This patch set is based off
> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") on the sched/core branch of tip.git.


I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
notice that apart from hackbench, every other bechmark is showing
regressions with this patch series. Quick summary of my observations:

* With shared-runqueue enabled, tbench and netperf both stop scaling
  when we go beyond 32 clients and the scaling issue persists until
  the system is overutilized. When the system is overutilized,
  shared-runqueue is able to recover quite splendidly and outperform
  tip.

* stream doesn't show any significant difference with the
  shared-runqueue as expected.

* schbench shows no major regressions for the requests-per-second and
  the request-latency until the system is completely saturated at
  which point, I do see some improvements with the shared
  runqueue. However, the wakeup-latency is bad when the system is
  moderately utilized.

* mongodb shows 3.5% regression with shared runqueue enabled.

Please find the detailed results at the end of this mail.

Scalability for tbench and netperf
==================================
I want to call out the reason for the scaling issues observed with
tbench and netperf when the number of clients are between 32 to 256. I
will use tbench here to illustrate the analysis.

As I had mentioned, in my response to Aaron's RFC,
(https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/#t)
in the aforementioned cases, I could observe a bottleneck with
update_cfs_group() and update_load_avg() which is due to the fact that
we do a lot more task migrations when the shared runqueue is enabled.

  Overhead  Command  Shared Object     Symbol
+   20.54%  tbench   [kernel.vmlinux]  [k] update_cfs_group
+   15.78%  tbench   [kernel.vmlinux]  [k] update_load_avg

Applying Aaron's ratelimiting patch helps improve the
scalability. Previously the throughput values for 32 clients, 64
clients, 128 clients and 256 clients were very close to each other but
with Aaron's patch, that improved. However, the regression still
persisted.

==================================================================
Test          : tbench 
Units         : Normalized throughput 
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:  tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
   32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
   64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
  128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
  256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)


With Aaron's fix, perf showed that there were a lot of samples for
update_sd_lb_stats().

Samples: 8M of event 'ibs_op//', Event count (approx.): 28860989545448
  Overhead  Command  Shared Object         Symbol
-   13.00%  tbench   [kernel.vmlinux]      [k] update_sd_lb_stats.constprop.0
   - 7.21% update_sd_lb_stats.constprop.0                                    
      - 7.21% find_busiest_group                                                
           load_balance                                                         
         - newidle_balance                                                      
            + 5.90% pick_next_task_fair                                         
            + 1.31% balance_fair                                                
   - 3.05% cpu_util                                                             
      - 2.63% update_sd_lb_stats.constprop.0                                    
           find_busiest_group                                                   
           load_balance                                                         
         + newidle_balance                                                      
   - 1.67% idle_cpu                                                             
      - 1.36% update_sd_lb_stats.constprop.0                                    
           find_busiest_group                                                   
           load_balance                                                         
         - newidle_balance                                                      
            + 1.11% pick_next_task_fair   

perf annotate shows the hotspot to be a harmless looking "add"
instruction update_sg_lb_stats() which adds a value obtained from
cfs_rq->avg.load_avg to sg->group_load.

       │     cfs_rq_load_avg():
       │     return cfs_rq->avg.load_avg;
  0.31 │       mov    0x220(%r8),%rax
       │     update_sg_lb_stats():
       │     sgs->group_load += load;
 15.90 │       add    %rax,0x8(%r13)
       │     cfs_rq_load_avg():
       │     return cfs_rq->avg.load_avg;

So, I counted the number of times the CPUs call find_busiest_group()
without and with shared_rq and the distribution is quite stark.

=====================================================
per-cpu             :          Number of CPUs       :
find_busiest_group  :----------------:--------------:
count               :  without-sh.rq :  with-sh.rq  :
=====================================:===============
[      0 -  200000) :     77
[ 200000 -  400000) :     41
[ 400000 -  600000) :     64
[ 600000 -  800000) :     63
[ 800000 - 1000000) :     66
[1000000 - 1200000) :     69
[1200000 - 1400000) :     52
[1400000 - 1600000) :     34              5
[1600000 - 1800000) :     17		 31
[1800000 - 2000000) :      6		 59
[2000000 - 2200000) :     13		109
[2200000 - 2400000) :      4		120
[2400000 - 2600000) :      3		157
[2600000 - 2800000) :      1		 29
[2800000 - 3000000) :      1		  2
[9200000 - 9400000) :      1

As you can notice, the number of calls to find_busiest_group() without
the shared.rq is greater at the lower end of distribution, which
implies fewer calls in total. With shared-rq enabled, the distribution
is normal, but shifted to the right, which implies a lot more calls to
find_busiest_group().

To investigate further, where this is coming from, I reran tbench with
sched-scoreboard (https://github.com/AMDESE/sched-scoreboard), and the
schedstats shows the the total wait-time of the tasks on the runqueue
*increases* by a significant amount when shared-rq is enabled.

Further, if you notice the newidle load_balance() attempts at the DIE
and the NUMA domains, they are significantly higher when shared-rq is
enabled. So it appears that a lot more time is being spent trying to
do load-balancing when shared runqueue is enabled, which is counter
intutitive.

----------------------------------------------------------------------------------------------------
Time elapsed (in jiffies)                                  :         39133,        39132           
----------------------------------------------------------------------------------------------------
cpu:  all_cpus (avg) vs cpu:  all_cpus (avg)
----------------------------------------------------------------------------------------------------
sched_yield count                                          :             0,            0           
Legacy counter can be ignored                              :             0,            0           
schedule called                                            :       9112673,      5014567  | -44.97|
schedule left the processor idle                           :       4554145,      2460379  | -45.97|
try_to_wake_up was called                                  :       4556347,      2552974  | -43.97|
try_to_wake_up was called to wake up the local cpu         :          2227,         1350  | -39.38| 
total runtime by tasks on this processor (in ns)           :   41093465125,  33816591424  | -17.71|
total waittime by tasks on this processor (in ns)          :      21832848,   3382037232  |15390.59| <======
total timeslices run on this cpu                           :       4558524,      2554181  | -43.97|

----------------------------------------------------------------------------------------------------
domain:  SMT (NO_SHARED_RUNQ vs SHARED_RUNQ)
----------------------------------------------------------------------------------------------------
< ----------------------------------------  Category:  newidle ---------------------------------------- >
load_balance count on cpu newly idle                       :        964585,       619463  | -35.78|
load_balance found balanced on cpu newly idle              :        964573,       619303  | -35.80|
  ->load_balance failed to find bsy q on cpu newly idle    :             0,            0           
  ->load_balance failed to find bsy grp on cpu newly idle  :        964423,       617603  | -35.96|
load_balance move task failed on cpu newly idle            :             5,          110  |2100.00|
*load_balance success cnt on cpu newidle                   :             7,           50  | 614.29|
pull_task count on cpu newly idle                          :             6,           48  | 700.00|
*avg task pulled per successfull lb attempt (cpu newidle)  :       0.85714,      0.96000  |  12.00|
  ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           

----------------------------------------------------------------------------------------------------
domain:  MC (NO_SHARED_RUNQ vs SHARED_RUNQ)
----------------------------------------------------------------------------------------------------
< ----------------------------------------  Category:  newidle ---------------------------------------- >
load_balance count on cpu newly idle                       :        803080,       615613  | -23.34|
load_balance found balanced on cpu newly idle              :        641630,       568818  | -11.35|
  ->load_balance failed to find bsy q on cpu newly idle    :           178,          616  | 246.07|
  ->load_balance failed to find bsy grp on cpu newly idle  :        641446,       568082  | -11.44|
load_balance move task failed on cpu newly idle            :        161448,        46296  | -71.32|
*load_balance success cnt on cpu newidle                   :             2,          499  |24850.00|
pull_task count on cpu newly idle                          :             1,          498  |49700.00|
*avg task pulled per successfull lb attempt (cpu newidle)  :       0.50000,      0.99800  |  99.60|
  ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           

----------------------------------------------------------------------------------------------------
domain:  DIE cpus = all_cpus (avg) vs domain:  DIE cpus = all_cpus (avg)
----------------------------------------------------------------------------------------------------
< ----------------------------------------  Category:  newidle ---------------------------------------- >
load_balance count on cpu newly idle                       :          2761,       566824  |20429.66| <======
load_balance found balanced on cpu newly idle              :          1737,       284232  |16263.39|
  ->load_balance failed to find bsy q on cpu newly idle    :             0,          537            
  ->load_balance failed to find bsy grp on cpu newly idle  :          1736,       283427  |16226.44|
load_balance move task failed on cpu newly idle            :          1023,       282021  |27468.04|
*load_balance success cnt on cpu newidle                   :             1,          571  |57000.00|
pull_task count on cpu newly idle                          :             0,          571           
*avg task pulled per successfull lb attempt (cpu newidle)  :             0,            1           
  ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           

----------------------------------------------------------------------------------------------------
domain:  NUMA cpus = all_cpus (avg) vs domain:  NUMA cpus = all_cpus (avg)
----------------------------------------------------------------------------------------------------
< ----------------------------------------  Category:  newidle ---------------------------------------- >
load_balance count on cpu newly idle                       :            38,        47936  |126047.37| <======
load_balance found balanced on cpu newly idle              :            20,        26628  |133040.00|
  ->load_balance failed to find bsy q on cpu newly idle    :             0,            0             
  ->load_balance failed to find bsy grp on cpu newly idle  :            20,        26531  |132555.00|
load_balance move task failed on cpu newly idle            :            18,        21167  |117494.44|
*load_balance success cnt on cpu newidle                   :             0,          141             
pull_task count on cpu newly idle                          :             0,          140           
*avg task pulled per successfull lb attempt (cpu newidle)  :             0,      0.99291           
  ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           

< ----------------------------------------  Wakeup info:  ---------------------------------------- >
Wakeups on same                          CPU (avg)      :          2227,         1350  | -39.38|
Wakeups on same         SMT cpus = all_cpus (avg)       :         85553,        30942  | -63.83|
Wakeups on same         MC cpus = all_cpus (avg)        :       4468548,      2520585  | -43.59|
Wakeups on same         DIE cpus = all_cpus (avg)       :             9,           60  | 566.67|
Wakeups on same         NUMA cpus = all_cpus (avg)      :             8,           35  | 337.50|

Affine wakeups on same  SMT cpus = all_cpus (avg)       :         85484,        18848  | -77.95|
Affine wakeups on same  MC cpus = all_cpus (avg)        :       4465108,      1511225  | -66.15| <======
Affine wakeups on same  DIE cpus = all_cpus (avg)       :             1,          569  |56800.00:
Affine wakeups on same  NUMA cpus = all_cpus (avg)      :             0,          140           



Detailed Results are as follows:
=============================================================
Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.

tip                 : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
                      bandwidth in use")
sh_rq_v3            : This patchset with SHARED_RUNQ feature enabled.

sh_rq_v3_tgload_fix : This patchset along with Aaron's patch
                      (https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/)

The trend is similar on a 2 Socket Zen3 with 64 cores per socket, SMT
enabled. So, I am ommitting it.

==================================================================
Test          : hackbench 
Units         : Normalized time in seconds 
Interpretation: Lower is better 
Statistic     : AMean 
==================================================================
Case:        tip[pct imp](CV)        sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
 1-groups     1.00 [  0.00]( 8.41)     0.96 [  3.63]( 6.04)     0.94 [  6.48]( 9.16)
 2-groups     1.00 [  0.00](12.96)     0.96 [  4.46]( 9.76)     0.89 [ 11.02]( 8.28)
 4-groups     1.00 [  0.00]( 2.90)     0.85 [ 14.77]( 9.18)     0.86 [ 14.35](13.26)
 8-groups     1.00 [  0.00]( 1.06)     0.91 [  8.96]( 2.83)     0.94 [  6.34]( 2.02)
16-groups     1.00 [  0.00]( 0.57)     1.19 [-18.91]( 2.82)     0.74 [ 26.02]( 1.33)


==================================================================
Test          : tbench 
Units         : Normalized throughput 
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:   tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
    1     1.00 [  0.00]( 0.26)     0.99 [ -1.25]( 0.13)     0.98 [ -2.15]( 0.49)
    2     1.00 [  0.00]( 0.37)     0.98 [ -2.33]( 0.88)     0.98 [ -2.21]( 0.53)
    4     1.00 [  0.00]( 0.66)     0.99 [ -1.32]( 0.91)     0.98 [ -2.12]( 0.79)
    8     1.00 [  0.00]( 2.14)     0.99 [ -0.53]( 2.45)     1.00 [ -0.23]( 2.18)
   16     1.00 [  0.00]( 1.08)     0.97 [ -3.37]( 2.12)     0.95 [ -5.28]( 1.92)
   32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
   64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
  128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
  256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)
  512     1.00 [  0.00]( 2.78)     1.33 [ 33.50]( 4.12)     1.22 [ 22.33]( 2.59)
 1024     1.00 [  0.00]( 0.46)     1.34 [ 34.27]( 0.37)     1.31 [ 31.36]( 1.65)
 2048     1.00 [  0.00]( 0.75)     1.40 [ 40.42]( 0.05)     1.20 [ 20.09]( 1.98)


==================================================================
Test          : stream (10 Runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
 Copy     1.00 [  0.00]( 0.84)     1.00 [ -0.22]( 0.59)     1.00 [  0.08]( 0.90)
Scale     1.00 [  0.00]( 0.42)     1.00 [ -0.33]( 0.39)     1.00 [ -0.15]( 0.42)
  Add     1.00 [  0.00]( 0.58)     1.00 [ -0.48]( 0.28)     1.00 [ -0.22]( 0.34)
Triad     1.00 [  0.00]( 0.41)     0.99 [ -0.65]( 0.38)     1.00 [ -0.29]( 0.34)


==================================================================
Test          : stream (100 runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
 Copy     1.00 [  0.00]( 0.52)     1.00 [ -0.16]( 0.45)     1.00 [  0.35]( 0.73)
Scale     1.00 [  0.00]( 0.35)     1.00 [ -0.20]( 0.38)     1.00 [  0.07]( 0.34)
  Add     1.00 [  0.00]( 0.37)     1.00 [ -0.07]( 0.42)     1.00 [  0.07]( 0.46)
Triad     1.00 [  0.00]( 0.57)     1.00 [ -0.22]( 0.45)     1.00 [ -0.04]( 0.49)


==================================================================
Test          : netperf 
Units         : Normalized Througput 
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:      tip[pct imp](CV)        sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.87)     1.00 [  0.08]( 0.17)     0.98 [ -1.64]( 0.34)
 2-clients     1.00 [  0.00]( 1.42)     0.99 [ -0.93]( 0.75)     0.98 [ -2.18]( 0.68)
 4-clients     1.00 [  0.00]( 1.16)     0.97 [ -3.05]( 1.18)     0.96 [ -4.29]( 1.11)
 8-clients     1.00 [  0.00]( 1.41)     0.97 [ -3.18]( 1.04)     0.96 [ -4.04]( 0.98)
16-clients     1.00 [  0.00]( 1.85)     0.95 [ -4.87]( 1.00)     0.96 [ -4.22]( 0.98)
32-clients     1.00 [  0.00]( 2.17)     0.33 [-66.78]( 1.11)     0.95 [ -4.95]( 1.74)
64-clients     1.00 [  0.00]( 2.70)     0.20 [-79.62]( 1.45)     0.45 [-54.66]( 1.79)
128-clients     1.00 [  0.00]( 2.80)     0.13 [-86.68]( 3.15)     0.37 [-62.60]( 1.60)
256-clients     1.00 [  0.00]( 9.14)     0.13 [-86.89]( 8.53)     0.92 [ -8.12]( 1.91)
512-clients     1.00 [  0.00](11.46)     1.18 [ 18.05]( 4.73)     1.12 [ 12.32]( 5.50)


==================================================================
Test          : schbench: requests-per-second 
Units         : Normalized Requests per second 
Interpretation: Higher is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
  1     1.00 [  0.00]( 0.00)     1.02 [  1.67]( 0.45)     1.01 [  1.34]( 0.45)
  2     1.00 [  0.00]( 0.17)     1.01 [  1.00]( 0.17)     1.01 [  1.33]( 0.17)
  4     1.00 [  0.00]( 0.30)     1.01 [  1.34]( 0.17)     1.01 [  1.34]( 0.00)
  8     1.00 [  0.00]( 0.30)     1.01 [  1.34]( 0.00)     1.01 [  1.34]( 0.00)
 16     1.00 [  0.00]( 0.17)     1.01 [  1.00]( 0.17)     1.01 [  1.00]( 0.00)
 32     1.00 [  0.00]( 0.00)     1.01 [  0.66]( 0.00)     1.01 [  0.66]( 0.17)
 64     1.00 [  0.00]( 0.00)     1.01 [  0.66]( 0.17)     1.01 [  0.66]( 0.17)
128     1.00 [  0.00]( 5.70)     0.96 [ -4.06]( 0.32)     0.95 [ -5.08]( 0.18)
256     1.00 [  0.00]( 0.29)     1.04 [  4.23]( 0.00)     1.04 [  4.23]( 0.00)
512     1.00 [  0.00]( 0.39)     1.00 [  0.00]( 0.19)     1.00 [  0.00]( 0.00)


==================================================================
Test          : schbench: wakeup-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================
#workers:  tip[pct imp](CV)    sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
  1     1.00 [  0.00](12.39)     1.00 [  0.00]( 0.00)     1.11 [-11.11]( 0.00)
  2     1.00 [  0.00]( 5.53)     1.00 [  0.00]( 0.00)     1.11 [-11.11]( 0.00)
  4     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.11 [-11.11]( 5.00)
  8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.22 [-22.22]( 4.84)
 16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.22 [-22.22]( 4.84)
 32     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.12 [-12.50]( 0.00)
 64     1.00 [  0.00]( 7.04)     1.29 [-28.57]( 5.96)     1.29 [-28.57]( 0.00)
128     1.00 [  0.00]( 5.53)     1.44 [-44.44]( 0.00)     1.56 [-55.56]( 3.78)
256     1.00 [  0.00](17.11)     7.96 [-696.25]( 4.54)     8.14 [-713.75]( 3.99)
512     1.00 [  0.00]( 2.39)     0.82 [ 17.70]( 7.19)     0.96 [  4.43](10.52)


==================================================================
Test          : schbench: request-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)    sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
  1     1.00 [  0.00]( 0.21)     0.98 [  1.63]( 0.92)     1.00 [  0.23]( 0.79)
  2     1.00 [  0.00]( 0.12)     1.00 [  0.23]( 0.00)     1.00 [  0.23]( 0.32)
  4     1.00 [  0.00]( 0.12)     1.00 [  0.00]( 0.24)     1.00 [  0.23]( 0.00)
  8     1.00 [  0.00]( 0.00)     1.00 [  0.23]( 0.12)     1.00 [  0.23]( 0.12)
 16     1.00 [  0.00]( 0.12)     1.00 [  0.00]( 0.00)     1.00 [  0.23]( 0.12)
 32     1.00 [  0.00]( 0.00)     1.00 [  0.23]( 0.00)     1.00 [  0.23]( 0.12)
 64     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.12)     1.00 [  0.00]( 0.12)
128     1.00 [  0.00]( 2.80)     0.99 [  1.50]( 0.35)     0.99 [  1.25]( 0.00)
256     1.00 [  0.00]( 0.11)     0.97 [  3.44]( 0.23)     0.97 [  2.80]( 0.34)
512     1.00 [  0.00]( 1.28)     1.01 [ -0.77]( 9.09)     1.19 [-19.31](14.03)


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
@ 2023-08-18  5:03   ` David Vernet
  2023-08-18  8:49     ` Gautham R. Shenoy
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-18  5:03 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> Hello David,

Hello Gautham,

Thanks a lot as always for running some benchmarks and analyzing these
changes.

> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > Changes
> > -------
> > 
> > This is v3 of the shared runqueue patchset. This patch set is based off
> > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") on the sched/core branch of tip.git.
> 
> 
> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> notice that apart from hackbench, every other bechmark is showing
> regressions with this patch series. Quick summary of my observations:

Just to verify per our prior conversation [0], was this latest set of
benchmarks run with boost disabled? Your analysis below seems to
indicate pretty clearly that v3 regresses some workloads regardless of
boost, but I did just want to double check regardless just to bear it in
mind when looking over the results.

[0]: https://lore.kernel.org/all/ZMn4dLePB45M5CGa@BLR-5CG11610CF.amd.com/

> * With shared-runqueue enabled, tbench and netperf both stop scaling
>   when we go beyond 32 clients and the scaling issue persists until
>   the system is overutilized. When the system is overutilized,
>   shared-runqueue is able to recover quite splendidly and outperform
>   tip.

Hmm, I still don't understand why we perform better when the system is
overutilized. I'd expect vanilla CFS to perform better than shared_runq
in such a scenario in general, as the system will be fully utilized
regardless.

> * stream doesn't show any significant difference with the
>   shared-runqueue as expected.
> 
> * schbench shows no major regressions for the requests-per-second and
>   the request-latency until the system is completely saturated at
>   which point, I do see some improvements with the shared
>   runqueue. However, the wakeup-latency is bad when the system is
>   moderately utilized.
> 
> * mongodb shows 3.5% regression with shared runqueue enabled.

This is indeed surprising. I think I have a theory, as described below.

> 
> Please find the detailed results at the end of this mail.
> 
> Scalability for tbench and netperf
> ==================================
> I want to call out the reason for the scaling issues observed with
> tbench and netperf when the number of clients are between 32 to 256. I
> will use tbench here to illustrate the analysis.
> 
> As I had mentioned, in my response to Aaron's RFC,
> (https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/#t)
> in the aforementioned cases, I could observe a bottleneck with
> update_cfs_group() and update_load_avg() which is due to the fact that
> we do a lot more task migrations when the shared runqueue is enabled.
> 
>   Overhead  Command  Shared Object     Symbol
> +   20.54%  tbench   [kernel.vmlinux]  [k] update_cfs_group
> +   15.78%  tbench   [kernel.vmlinux]  [k] update_load_avg
> 
> Applying Aaron's ratelimiting patch helps improve the
> scalability. Previously the throughput values for 32 clients, 64
> clients, 128 clients and 256 clients were very close to each other but
> with Aaron's patch, that improved. However, the regression still
> persisted.
> 
> ==================================================================
> Test          : tbench 
> Units         : Normalized throughput 
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:  tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>    32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
>    64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
>   128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
>   256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)

Just to make sure we're on the same page, "CV" here is the coefficient
of variation (i.e. standard deviation / mean), correct?

> With Aaron's fix, perf showed that there were a lot of samples for
> update_sd_lb_stats().
> 
> Samples: 8M of event 'ibs_op//', Event count (approx.): 28860989545448
>   Overhead  Command  Shared Object         Symbol
> -   13.00%  tbench   [kernel.vmlinux]      [k] update_sd_lb_stats.constprop.0
>    - 7.21% update_sd_lb_stats.constprop.0                                    
>       - 7.21% find_busiest_group                                                
>            load_balance                                                         
>          - newidle_balance                                                      
>             + 5.90% pick_next_task_fair                                         
>             + 1.31% balance_fair                                                
>    - 3.05% cpu_util                                                             
>       - 2.63% update_sd_lb_stats.constprop.0                                    
>            find_busiest_group                                                   
>            load_balance                                                         
>          + newidle_balance                                                      
>    - 1.67% idle_cpu                                                             
>       - 1.36% update_sd_lb_stats.constprop.0                                    
>            find_busiest_group                                                   
>            load_balance                                                         
>          - newidle_balance                                                      
>             + 1.11% pick_next_task_fair   
> 
> perf annotate shows the hotspot to be a harmless looking "add"
> instruction update_sg_lb_stats() which adds a value obtained from
> cfs_rq->avg.load_avg to sg->group_load.
> 
>        │     cfs_rq_load_avg():
>        │     return cfs_rq->avg.load_avg;
>   0.31 │       mov    0x220(%r8),%rax
>        │     update_sg_lb_stats():
>        │     sgs->group_load += load;
>  15.90 │       add    %rax,0x8(%r13)
>        │     cfs_rq_load_avg():
>        │     return cfs_rq->avg.load_avg;
> 
> So, I counted the number of times the CPUs call find_busiest_group()
> without and with shared_rq and the distribution is quite stark.
> 
> =====================================================
> per-cpu             :          Number of CPUs       :
> find_busiest_group  :----------------:--------------:
> count               :  without-sh.rq :  with-sh.rq  :
> =====================================:===============
> [      0 -  200000) :     77
> [ 200000 -  400000) :     41
> [ 400000 -  600000) :     64
> [ 600000 -  800000) :     63
> [ 800000 - 1000000) :     66
> [1000000 - 1200000) :     69
> [1200000 - 1400000) :     52
> [1400000 - 1600000) :     34              5
> [1600000 - 1800000) :     17		 31
> [1800000 - 2000000) :      6		 59
> [2000000 - 2200000) :     13		109
> [2200000 - 2400000) :      4		120
> [2400000 - 2600000) :      3		157
> [2600000 - 2800000) :      1		 29
> [2800000 - 3000000) :      1		  2
> [9200000 - 9400000) :      1
> 
> As you can notice, the number of calls to find_busiest_group() without
> the shared.rq is greater at the lower end of distribution, which
> implies fewer calls in total. With shared-rq enabled, the distribution
> is normal, but shifted to the right, which implies a lot more calls to
> find_busiest_group().

Huh, that's very much unexpected for obvious reasons -- we should be
hitting the load_balance() path _less_ due to scheduling tasks with the
shared_runq.

> To investigate further, where this is coming from, I reran tbench with
> sched-scoreboard (https://github.com/AMDESE/sched-scoreboard), and the
> schedstats shows the the total wait-time of the tasks on the runqueue
> *increases* by a significant amount when shared-rq is enabled.
> 
> Further, if you notice the newidle load_balance() attempts at the DIE
> and the NUMA domains, they are significantly higher when shared-rq is
> enabled. So it appears that a lot more time is being spent trying to
> do load-balancing when shared runqueue is enabled, which is counter
> intutitive.

Certainly agreed on it being counter intuitive. I wonder if this is due
to this change in the latest v3 revision [1]:

@@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	/*
+	 * Skip <= LLC domains as they likely won't have any tasks if the
+	 * shared runq is empty.
+	 */
+	if (sched_feat(SHARED_RUNQ)) {
+		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+		if (likely(sd))
+			sd = sd->parent;
+	}
+
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

[1]: https://lore.kernel.org/all/20230809221218.163894-7-void@manifault.com/

I originally added this following Peter's suggestion in v2 [2] with the
idea that we'd skip the <= LLC domains when shared_runq is enabled, but
in hindsight, we also aren't walking the shared_runq list until we find
a task that can run on the current core. If there's a task, and it can't
run on the current core, we give up and proceed to the rest of
newidle_balance(). So it's possible that we're incorrectly assuming
there's no task in the current LLC because it wasn't enqueued at the
head of the shared_runq. I think we'd only want to add this improvement
if we walked the list as you tried in v1 of the patch set, or revert it
otherwise.

[2]: https://lore.kernel.org/lkml/20230711094547.GE3062772@hirez.programming.kicks-ass.net/

Would you be able to try running your benchmarks again with that change
removed, or with your original idea of walking the list added? I've
tried to reproduce this issue on my 7950x, as well as a single-socket /
CCX 26 core / 52 thread Cooper Lake host, but have been unable to. For
example, if I run funccount.py 'load_balance' -d 30 (from [3]) while
running the below netperf command on the Cooper Lake, this is what I
see:

for i in `seq 128`; do netperf -6 -t UDP_RR -c -C -l 60 & done

NO_SHARED_RUNQ
--------------
FUNC                                    COUNT
b'load_balance'                         39636


SHARED_RUNQ
-----------
FUNC                                    COUNT
b'load_balance'                         32345

[3]: https://github.com/iovisor/bcc/blob/master/tools/funccount.py

The stack traces also don't show us running load_balance() excessively.
Please feel free to share how you're running tbench as well and I can
try that on my end.

> ----------------------------------------------------------------------------------------------------
> Time elapsed (in jiffies)                                  :         39133,        39132           
> ----------------------------------------------------------------------------------------------------
> cpu:  all_cpus (avg) vs cpu:  all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> sched_yield count                                          :             0,            0           
> Legacy counter can be ignored                              :             0,            0           
> schedule called                                            :       9112673,      5014567  | -44.97|
> schedule left the processor idle                           :       4554145,      2460379  | -45.97|
> try_to_wake_up was called                                  :       4556347,      2552974  | -43.97|
> try_to_wake_up was called to wake up the local cpu         :          2227,         1350  | -39.38| 
> total runtime by tasks on this processor (in ns)           :   41093465125,  33816591424  | -17.71|
> total waittime by tasks on this processor (in ns)          :      21832848,   3382037232  |15390.59| <======
> total timeslices run on this cpu                           :       4558524,      2554181  | -43.97|
> 
> ----------------------------------------------------------------------------------------------------
> domain:  SMT (NO_SHARED_RUNQ vs SHARED_RUNQ)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :        964585,       619463  | -35.78|
> load_balance found balanced on cpu newly idle              :        964573,       619303  | -35.80|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,            0           
>   ->load_balance failed to find bsy grp on cpu newly idle  :        964423,       617603  | -35.96|
> load_balance move task failed on cpu newly idle            :             5,          110  |2100.00|
> *load_balance success cnt on cpu newidle                   :             7,           50  | 614.29|
> pull_task count on cpu newly idle                          :             6,           48  | 700.00|
> *avg task pulled per successfull lb attempt (cpu newidle)  :       0.85714,      0.96000  |  12.00|
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  MC (NO_SHARED_RUNQ vs SHARED_RUNQ)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :        803080,       615613  | -23.34|
> load_balance found balanced on cpu newly idle              :        641630,       568818  | -11.35|
>   ->load_balance failed to find bsy q on cpu newly idle    :           178,          616  | 246.07|
>   ->load_balance failed to find bsy grp on cpu newly idle  :        641446,       568082  | -11.44|
> load_balance move task failed on cpu newly idle            :        161448,        46296  | -71.32|
> *load_balance success cnt on cpu newidle                   :             2,          499  |24850.00|
> pull_task count on cpu newly idle                          :             1,          498  |49700.00|
> *avg task pulled per successfull lb attempt (cpu newidle)  :       0.50000,      0.99800  |  99.60|
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  DIE cpus = all_cpus (avg) vs domain:  DIE cpus = all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :          2761,       566824  |20429.66| <======
> load_balance found balanced on cpu newly idle              :          1737,       284232  |16263.39|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,          537            
>   ->load_balance failed to find bsy grp on cpu newly idle  :          1736,       283427  |16226.44|
> load_balance move task failed on cpu newly idle            :          1023,       282021  |27468.04|
> *load_balance success cnt on cpu newidle                   :             1,          571  |57000.00|
> pull_task count on cpu newly idle                          :             0,          571           
> *avg task pulled per successfull lb attempt (cpu newidle)  :             0,            1           
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> ----------------------------------------------------------------------------------------------------
> domain:  NUMA cpus = all_cpus (avg) vs domain:  NUMA cpus = all_cpus (avg)
> ----------------------------------------------------------------------------------------------------
> < ----------------------------------------  Category:  newidle ---------------------------------------- >
> load_balance count on cpu newly idle                       :            38,        47936  |126047.37| <======
> load_balance found balanced on cpu newly idle              :            20,        26628  |133040.00|
>   ->load_balance failed to find bsy q on cpu newly idle    :             0,            0             
>   ->load_balance failed to find bsy grp on cpu newly idle  :            20,        26531  |132555.00|
> load_balance move task failed on cpu newly idle            :            18,        21167  |117494.44|
> *load_balance success cnt on cpu newidle                   :             0,          141             
> pull_task count on cpu newly idle                          :             0,          140           
> *avg task pulled per successfull lb attempt (cpu newidle)  :             0,      0.99291           
>   ->pull_task whn target task was cache-hot on cpu newidle :             0,            0           
> 
> < ----------------------------------------  Wakeup info:  ---------------------------------------- >
> Wakeups on same                          CPU (avg)      :          2227,         1350  | -39.38|
> Wakeups on same         SMT cpus = all_cpus (avg)       :         85553,        30942  | -63.83|
> Wakeups on same         MC cpus = all_cpus (avg)        :       4468548,      2520585  | -43.59|
> Wakeups on same         DIE cpus = all_cpus (avg)       :             9,           60  | 566.67|
> Wakeups on same         NUMA cpus = all_cpus (avg)      :             8,           35  | 337.50|
> 
> Affine wakeups on same  SMT cpus = all_cpus (avg)       :         85484,        18848  | -77.95|
> Affine wakeups on same  MC cpus = all_cpus (avg)        :       4465108,      1511225  | -66.15| <======
> Affine wakeups on same  DIE cpus = all_cpus (avg)       :             1,          569  |56800.00:
> Affine wakeups on same  NUMA cpus = all_cpus (avg)      :             0,          140           
> 
> 
> 
> Detailed Results are as follows:
> =============================================================
> Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.
> 
> tip                 : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>                       bandwidth in use")
> sh_rq_v3            : This patchset with SHARED_RUNQ feature enabled.
> 
> sh_rq_v3_tgload_fix : This patchset along with Aaron's patch
>                       (https://lore.kernel.org/lkml/20230816024831.682107-2-aaron.lu@intel.com/)
> 
> The trend is similar on a 2 Socket Zen3 with 64 cores per socket, SMT
> enabled. So, I am ommitting it.
> 
> ==================================================================
> Test          : hackbench 
> Units         : Normalized time in seconds 
> Interpretation: Lower is better 
> Statistic     : AMean 
> ==================================================================
> Case:        tip[pct imp](CV)        sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  1-groups     1.00 [  0.00]( 8.41)     0.96 [  3.63]( 6.04)     0.94 [  6.48]( 9.16)
>  2-groups     1.00 [  0.00](12.96)     0.96 [  4.46]( 9.76)     0.89 [ 11.02]( 8.28)
>  4-groups     1.00 [  0.00]( 2.90)     0.85 [ 14.77]( 9.18)     0.86 [ 14.35](13.26)
>  8-groups     1.00 [  0.00]( 1.06)     0.91 [  8.96]( 2.83)     0.94 [  6.34]( 2.02)
> 16-groups     1.00 [  0.00]( 0.57)     1.19 [-18.91]( 2.82)     0.74 [ 26.02]( 1.33)

Nice, this matches what I observed when running my benchmarks as well.

> ==================================================================
> Test          : tbench 
> Units         : Normalized throughput 
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:   tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>     1     1.00 [  0.00]( 0.26)     0.99 [ -1.25]( 0.13)     0.98 [ -2.15]( 0.49)
>     2     1.00 [  0.00]( 0.37)     0.98 [ -2.33]( 0.88)     0.98 [ -2.21]( 0.53)
>     4     1.00 [  0.00]( 0.66)     0.99 [ -1.32]( 0.91)     0.98 [ -2.12]( 0.79)
>     8     1.00 [  0.00]( 2.14)     0.99 [ -0.53]( 2.45)     1.00 [ -0.23]( 2.18)
>    16     1.00 [  0.00]( 1.08)     0.97 [ -3.37]( 2.12)     0.95 [ -5.28]( 1.92)
>    32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
>    64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
>   128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
>   256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)
>   512     1.00 [  0.00]( 2.78)     1.33 [ 33.50]( 4.12)     1.22 [ 22.33]( 2.59)
>  1024     1.00 [  0.00]( 0.46)     1.34 [ 34.27]( 0.37)     1.31 [ 31.36]( 1.65)
>  2048     1.00 [  0.00]( 0.75)     1.40 [ 40.42]( 0.05)     1.20 [ 20.09]( 1.98)
> 
> 
> ==================================================================
> Test          : stream (10 Runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)      sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.84)     1.00 [ -0.22]( 0.59)     1.00 [  0.08]( 0.90)
> Scale     1.00 [  0.00]( 0.42)     1.00 [ -0.33]( 0.39)     1.00 [ -0.15]( 0.42)
>   Add     1.00 [  0.00]( 0.58)     1.00 [ -0.48]( 0.28)     1.00 [ -0.22]( 0.34)
> Triad     1.00 [  0.00]( 0.41)     0.99 [ -0.65]( 0.38)     1.00 [ -0.29]( 0.34)
> 
> 
> ==================================================================
> Test          : stream (100 runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.52)     1.00 [ -0.16]( 0.45)     1.00 [  0.35]( 0.73)
> Scale     1.00 [  0.00]( 0.35)     1.00 [ -0.20]( 0.38)     1.00 [  0.07]( 0.34)
>   Add     1.00 [  0.00]( 0.37)     1.00 [ -0.07]( 0.42)     1.00 [  0.07]( 0.46)
> Triad     1.00 [  0.00]( 0.57)     1.00 [ -0.22]( 0.45)     1.00 [ -0.04]( 0.49)
> 
> 
> ==================================================================
> Test          : netperf 

Could you please share exactly how you're invoking netperf? Is this
with -t UDP_RR (which is what Aaron was running with), or the default?

[...]

As far as I can tell there weren't many changes between v2 and v3 that
could have caused this regression. I strongly suspect the heuristic I
mentioned above, especially with your analysis on us excessively calling
load_balance().

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-18  5:03   ` David Vernet
@ 2023-08-18  8:49     ` Gautham R. Shenoy
  2023-08-24 11:14       ` Gautham R. Shenoy
  0 siblings, 1 reply; 52+ messages in thread
From: Gautham R. Shenoy @ 2023-08-18  8:49 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> 
> Hello Gautham,
> 
> Thanks a lot as always for running some benchmarks and analyzing these
> changes.
> 
> > On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > > Changes
> > > -------
> > > 
> > > This is v3 of the shared runqueue patchset. This patch set is based off
> > > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > > bandwidth in use") on the sched/core branch of tip.git.
> > 
> > 
> > I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> > notice that apart from hackbench, every other bechmark is showing
> > regressions with this patch series. Quick summary of my observations:
> 
> Just to verify per our prior conversation [0], was this latest set of
> benchmarks run with boost disabled?

Boost is enabled by default. I will queue a run tonight with boost
disabled.

> Your analysis below seems to
> indicate pretty clearly that v3 regresses some workloads regardless of
> boost, but I did just want to double check regardless just to bear it in
> mind when looking over the results.
> 
> [0]: https://lore.kernel.org/all/ZMn4dLePB45M5CGa@BLR-5CG11610CF.amd.com/
> 
> > * With shared-runqueue enabled, tbench and netperf both stop scaling
> >   when we go beyond 32 clients and the scaling issue persists until
> >   the system is overutilized. When the system is overutilized,
> >   shared-runqueue is able to recover quite splendidly and outperform
> >   tip.
> 
> Hmm, I still don't understand why we perform better when the system is
> overutilized. I'd expect vanilla CFS to perform better than shared_runq
> in such a scenario in general, as the system will be fully utilized
> regardless.

My hunch is that when every rq is equally loaded, we perhaps don't
have so much time to spend doing newidle load balance at higher levels
because any idle CPU will likely find a task to pull from the shared
runqueue.

IOW, the shared runqueue feature is generally useful, but we need to
figure out why are we doing excessive load-balance when the system is
moderately utilized.


[..snip..]

> > 
> > ==================================================================
> > Test          : tbench 
> > Units         : Normalized throughput 
> > Interpretation: Higher is better 
> > Statistic     : AMean 
> > ==================================================================
> > Clients:  tip[pct imp](CV)       sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
> >    32     1.00 [  0.00]( 2.90)     0.44 [-55.53]( 1.44)     0.98 [ -2.23]( 1.72)
> >    64     1.00 [  0.00]( 1.02)     0.27 [-72.58]( 0.35)     0.74 [-25.64]( 2.43)
> >   128     1.00 [  0.00]( 0.88)     0.19 [-81.29]( 0.51)     0.52 [-48.47]( 3.92)
> >   256     1.00 [  0.00]( 0.28)     0.17 [-82.80]( 0.29)     0.88 [-12.23]( 1.76)
> 
> Just to make sure we're on the same page, "CV" here is the coefficient
> of variation (i.e. standard deviation / mean), correct?

Yes, CV is coefficient of variance : Ratio of the standard deviation
to the mean.


[..snip..]

> > So, I counted the number of times the CPUs call find_busiest_group()
> > without and with shared_rq and the distribution is quite stark.
> > 
> > =====================================================
> > per-cpu             :          Number of CPUs       :
> > find_busiest_group  :----------------:--------------:
> > count               :  without-sh.rq :  with-sh.rq  :
> > =====================================:===============
> > [      0 -  200000) :     77
> > [ 200000 -  400000) :     41
> > [ 400000 -  600000) :     64
> > [ 600000 -  800000) :     63
> > [ 800000 - 1000000) :     66
> > [1000000 - 1200000) :     69
> > [1200000 - 1400000) :     52
> > [1400000 - 1600000) :     34              5
> > [1600000 - 1800000) :     17		 31
> > [1800000 - 2000000) :      6		 59
> > [2000000 - 2200000) :     13		109
> > [2200000 - 2400000) :      4		120
> > [2400000 - 2600000) :      3		157
> > [2600000 - 2800000) :      1		 29
> > [2800000 - 3000000) :      1		  2
> > [9200000 - 9400000) :      1
> > 
> > As you can notice, the number of calls to find_busiest_group() without
> > the shared.rq is greater at the lower end of distribution, which
> > implies fewer calls in total. With shared-rq enabled, the distribution
> > is normal, but shifted to the right, which implies a lot more calls to
> > find_busiest_group().
> 
> Huh, that's very much unexpected for obvious reasons -- we should be
> hitting the load_balance() path _less_ due to scheduling tasks with the
> shared_runq.

I would like to verify what is the shared_rq hit-ratio when the system
is moderately loaded, while running short-running tasks such as
tbench/netperf. My hunch is that in the moderately loaded case, the
newly idle CPUs are not finding any task in the shared-runqueue.


> 
> > To investigate further, where this is coming from, I reran tbench with
> > sched-scoreboard (https://github.com/AMDESE/sched-scoreboard), and the
> > schedstats shows the the total wait-time of the tasks on the runqueue
> > *increases* by a significant amount when shared-rq is enabled.
> > 
> > Further, if you notice the newidle load_balance() attempts at the DIE
> > and the NUMA domains, they are significantly higher when shared-rq is
> > enabled. So it appears that a lot more time is being spent trying to
> > do load-balancing when shared runqueue is enabled, which is counter
> > intutitive.
> 
> Certainly agreed on it being counter intuitive. I wonder if this is due
> to this change in the latest v3 revision [1]:
> 
> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	rcu_read_lock();
>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
> +	/*
> +	 * Skip <= LLC domains as they likely won't have any tasks if the
> +	 * shared runq is empty.
> +	 */
> +	if (sched_feat(SHARED_RUNQ)) {
> +		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> +		if (likely(sd))
> +			sd = sd->parent;
> +	}
> +
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> 
> [1]: https://lore.kernel.org/all/20230809221218.163894-7-void@manifault.com/
> 
> I originally added this following Peter's suggestion in v2 [2] with the
> idea that we'd skip the <= LLC domains when shared_runq is enabled, but
> in hindsight, we also aren't walking the shared_runq list until we find
> a task that can run on the current core. If there's a task, and it can't
> run on the current core, we give up and proceed to the rest of
> newidle_balance(). So it's possible that we're incorrectly assuming
> there's no task in the current LLC because it wasn't enqueued at the
> head of the shared_runq. I think we'd only want to add this improvement
> if we walked the list as you tried in v1 of the patch set, or revert it
> otherwise.

Yes, the optimization does make sense if we are sure that there are no
tasks to be pulled in the SMT and the MC domain. Since I am not
pinning any tasks, if a newly idle CPU is doing load-balance it is
very likely because the shared-rq is empty. Which implies that the the
SMT and MC domains are not overloaded.

But that also means exploring load-balance at a fairly large DIE
domain much early on. And once we go past the should_we_balance()
check, we bail out only after doing the find_busiest_group()/queue,
which can take quite a bit of time on a DIE domain that spans 256
threads. By this time, if a task was woken up on the CPU, it would
have to wait for the load-balance to complete.

In any case, this can be easily verified by reverting the
optimization.

> 
> [2]: https://lore.kernel.org/lkml/20230711094547.GE3062772@hirez.programming.kicks-ass.net/
> 
> Would you be able to try running your benchmarks again with that change
> removed, or with your original idea of walking the list added?

I will queue this for later today.

> I've
> tried to reproduce this issue on my 7950x, as well as a single-socket /
> CCX 26 core / 52 thread Cooper Lake host, but have been unable to. For
> example, if I run funccount.py 'load_balance' -d 30 (from [3]) while
> running the below netperf command on the Cooper Lake, this is what I
> see:
> 
> for i in `seq 128`; do netperf -6 -t UDP_RR -c -C -l 60 & done
> 
> NO_SHARED_RUNQ
> --------------
> FUNC                                    COUNT
> b'load_balance'                         39636
> 
> 
> SHARED_RUNQ
> -----------
> FUNC                                    COUNT
> b'load_balance'                         32345
> 
> [3]: https://github.com/iovisor/bcc/blob/master/tools/funccount.py
> 
> The stack traces also don't show us running load_balance() excessively.
> Please feel free to share how you're running tbench as well and I can
> try that on my end.

This is how I am running tbench:

# wget https://www.samba.org/ftp/tridge/dbench/dbench-4.0.tar.gz
# tar xvf dbench-4.0.tar.gz
# cd dbench-4.0
# ./autogen.sh
# ./configure 
# make
# nohup ./tbench_srv 0 &
# ./tbench -t 60 <nr-clients> -c ./client.txt


[..snip..]

> 
> > ==================================================================
> > Test          : hackbench 
> > Units         : Normalized time in seconds 
> > Interpretation: Lower is better 
> > Statistic     : AMean 
> > ==================================================================
> > Case:        tip[pct imp](CV)        sh_rq_v3[pct imp](CV)    sh_rq_v3_tgload_fix[pct imp](CV)
> >  1-groups     1.00 [  0.00]( 8.41)     0.96 [  3.63]( 6.04)     0.94 [  6.48]( 9.16)
> >  2-groups     1.00 [  0.00](12.96)     0.96 [  4.46]( 9.76)     0.89 [ 11.02]( 8.28)
> >  4-groups     1.00 [  0.00]( 2.90)     0.85 [ 14.77]( 9.18)     0.86 [ 14.35](13.26)
> >  8-groups     1.00 [  0.00]( 1.06)     0.91 [  8.96]( 2.83)     0.94 [  6.34]( 2.02)
> > 16-groups     1.00 [  0.00]( 0.57)     1.19 [-18.91]( 2.82)     0.74 [ 26.02]( 1.33)
> 
> Nice, this matches what I observed when running my benchmarks as well.

Yes, hackbench seems to benefit from the shared-runqueue patches, more
so with Aaron's ratelimiting patch.

[..snip..]


> > Test          : netperf 
> 
> Could you please share exactly how you're invoking netperf? Is this
> with -t UDP_RR (which is what Aaron was running with), or the default?

No, this is how I am invoking netperf.

Having started the netserver that is listening in on 127.0.0.1,

I run N copies of the following command, where N is the number of clients.

netperf -H 127.0.0.1 -t TCP_RR -l 100 -- -r 100 -k REQUEST_SIZE,RESPONSE_SIZE,ELAPSED_TIME,THROUGHPUT,THROUGHPUT_UNITS,MIN_LATENCY,MEAN_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MAX_LATENCY,STDDEV_LATENCY

> 
> [...]
> 
> As far as I can tell there weren't many changes between v2 and v3 that
> could have caused this regression. I strongly suspect the heuristic I
> mentioned above, especially with your analysis on us excessively calling
> load_balance().

Sure. I will get back once I have those results.

> 
> Thanks,
> David

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-18  8:49     ` Gautham R. Shenoy
@ 2023-08-24 11:14       ` Gautham R. Shenoy
  2023-08-24 22:51         ` David Vernet
  0 siblings, 1 reply; 52+ messages in thread
From: Gautham R. Shenoy @ 2023-08-24 11:14 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
> Hello David,
> 
> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> > On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> > > Hello David,
> > 
> > Hello Gautham,
> > 
> > Thanks a lot as always for running some benchmarks and analyzing these
> > changes.
> > 
> > > On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > > > Changes
> > > > -------
> > > > 
> > > > This is v3 of the shared runqueue patchset. This patch set is based off
> > > > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > > > bandwidth in use") on the sched/core branch of tip.git.
> > > 
> > > 
> > > I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> > > notice that apart from hackbench, every other bechmark is showing
> > > regressions with this patch series. Quick summary of my observations:
> > 
> > Just to verify per our prior conversation [0], was this latest set of
> > benchmarks run with boost disabled?
> 
> Boost is enabled by default. I will queue a run tonight with boost
> disabled.

Apologies for the delay. I didn't see any changes with boost-disabled
and with reverting the optimization to bail out of the
newidle_balance() for SMT and MC domains when there was no task to be
pulled from the shared-runq. I reran the whole thing once again, just
to rule out any possible variance. The results came out the same.

With the boost disabled, and the optimization reverted, the results
don't change much.

It doesn't appear that the optimization is the cause for increase in
the number of load-balancing attempts at the DIE and the NUMA
domains. I have shared the counts of the newidle_balance with and
without SHARED_RUNQ below for tbench and it can be noticed that the
counts are significantly higher for the 64 clients and 128 clients. I
also captured the counts/s of find_busiest_group() using funccount.py
which tells the same story. So the drop in the performance for tbench
with your patches strongly correlates with the increase in
load-balancing attempts.

newidle balance is undertaken only if the overload flag is set and the
expected idle duration is greater than the avg load balancing cost. It
is hard to imagine why should the shared runq cause the overload flag
to be set!


Detailed Results are as follows:
=============================================================
Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.

tip             : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
v3              : v3 of the shared_runq patch
v3-tgfix        : v3+ Aaron's RFC v1 patch to ratelimit the updates to tg->load_avg
v3-tgfix-no-opt : v3-tgfix + revered the optimization to bail out of
                  newidle-balance for SMT and MC domains when there
                  are no tasks in the shared-runq

In the results below, I have chosen the first row, first column in the
table as the baseline so that we get an idea of the scalability issues
as the number of groups/clients/workers increase.

==================================================================
Test          : hackbench 
Units         : Normalized time in seconds 
Interpretation: Lower is better 
Statistic     : AMean 
==================================================================
Case:         tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 1-groups     1.00 [ -0.00]( 4.22)     0.92 [  7.75]( 9.09)     0.88 [ 11.53](10.61)     0.85 [ 15.31]( 8.20)
 2-groups     0.88 [ -0.00](11.65)     0.85 [  2.95](10.77)     0.88 [ -0.91]( 9.69)     0.88 [ -0.23]( 9.20)
 4-groups     1.08 [ -0.00]( 3.70)     0.93 [ 13.86](11.03)     0.90 [ 16.08]( 9.57)     0.83 [ 22.92]( 6.98)
 8-groups     1.32 [ -0.00]( 0.63)     1.16 [ 12.33]( 9.05)     1.21 [  8.72]( 5.54)     1.17 [ 11.13]( 5.29)
16-groups     1.71 [ -0.00]( 0.63)     1.93 [-12.65]( 4.68)     1.27 [ 25.87]( 1.31)     1.25 [ 27.15]( 1.10)


==================================================================
Test          : tbench 
Units         : Normalized throughput 
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:   tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
    1      1.00 [  0.00]( 0.18)      0.99 [ -0.99]( 0.18)      0.98 [ -2.08]( 0.10)      0.98 [ -2.19]( 0.24)
    2      1.95 [  0.00]( 0.65)      1.93 [ -1.04]( 0.72)      1.95 [ -0.37]( 0.31)      1.92 [ -1.73]( 0.39)
    4      3.80 [  0.00]( 0.59)      3.78 [ -0.53]( 0.37)      3.73 [ -1.66]( 0.58)      3.77 [ -0.79]( 0.97)
    8      7.49 [  0.00]( 0.37)      7.41 [ -1.12]( 0.39)      7.24 [ -3.42]( 1.99)      7.39 [ -1.39]( 1.53)
   16     14.78 [  0.00]( 0.84)     14.60 [ -1.24]( 1.51)     14.30 [ -3.28]( 1.28)     14.46 [ -2.18]( 0.78)
   32     28.18 [  0.00]( 1.26)     26.59 [ -5.65]( 0.46)     27.70 [ -1.71]( 0.92)     27.08 [ -3.90]( 0.83)
   64     55.05 [  0.00]( 1.56)     18.25 [-66.85]( 0.25)     48.07 [-12.68]( 1.51)     47.46 [-13.79]( 2.70)
  128    102.26 [  0.00]( 1.03)     21.74 [-78.74]( 0.65)     54.65 [-46.56]( 1.35)     54.69 [-46.52]( 1.16)
  256    156.69 [  0.00]( 0.27)     25.47 [-83.74]( 0.07)    130.85 [-16.49]( 0.57)    125.00 [-20.23]( 0.35)
  512    223.22 [  0.00]( 8.25)    236.98 [  6.17](17.10)    274.47 [ 22.96]( 0.44)    276.95 [ 24.07]( 3.37)
 1024    237.98 [  0.00]( 1.09)    299.72 [ 25.94]( 0.24)    304.89 [ 28.12]( 0.73)    300.37 [ 26.22]( 1.16)
 2048    242.13 [  0.00]( 0.37)    311.38 [ 28.60]( 0.24)    299.82 [ 23.82]( 1.35)    291.32 [ 20.31]( 0.66)


I reran tbench for v3-tgfix-no-opt, to collect the newidle balance
counts via schedstat as well as the find_busiest_group() counts via
funccount.py.

Comparison of the newidle balance counts across different
sched-domains for "v3-tgfix-no-opt" kernel with NO_SHARED_RUNQ vs
SHARED_RUNQ. We see a huge blowup for the DIE and the NUMA domains
when the number of clients are 64 and 128. The value within |xx.yy|
indicates the percentage increase when the difference is significant.

============== SMT load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 1986, 1960 
   2 clients: count : 5777, 6543     |  13.26|
   4 clients: count : 16775, 15274   |  -8.95|
   8 clients: count : 37086, 32715   | -11.79|
  16 clients: count : 69627, 65652   |  -5.71|
  32 clients: count : 152288, 42723  | -71.95|
  64 clients: count : 216396, 169545 | -21.65|
 128 clients: count : 219570, 649880 | 195.98|
 256 clients: count : 443595, 951933 | 114.60|
 512 clients: count : 5498, 1949     | -64.55|
1024 clients: count : 60, 3          | -95.00|
================ MC load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 1954, 1943
   2 clients: count : 5775, 6541      |  13.26|
   4 clients: count : 15468, 15087 
   8 clients: count : 31941, 32140 
  16 clients: count : 57312, 62553    |   9.14|
  32 clients: count : 125791, 34386   | -72.66|
  64 clients: count : 181406, 133978  | -26.14|
 128 clients: count : 191143, 607594  | 217.87|
 256 clients: count : 388696, 584568  |  50.39| 
 512 clients: count : 2677, 218       | -91.86|
1024 clients: count : 22, 3           | -86.36|
=============== DIE load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 10, 15          |   50.00|
   2 clients: count : 15, 56          |  273.33|
   4 clients: count : 65, 149         |  129.23|
   8 clients: count : 242, 412        |   70.25|
  16 clients: count : 509, 1235       |  142.63|
  32 clients: count : 909, 1371       |   50.83|
  64 clients: count : 1288, 59596     | 4527.02| <===
 128 clients: count : 666, 281426     |42156.16| <===
 256 clients: count : 213, 1463       |  586.85|
 512 clients: count : 28, 23          |  -17.86|
1024 clients: count : 10, 3           |  -70.00|
============== NUMA load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 9, 9 
   2 clients: count : 13, 14
   4 clients: count : 21, 21
   8 clients: count : 27, 29
  16 clients: count : 29, 50         |   72.41|
  32 clients: count : 29, 67         |  131.03|
  64 clients: count : 28, 9138       |32535.71|  <===
 128 clients: count : 25, 24234      |96836.00|  <===
 256 clients: count : 12, 11
 512 clients: count : 7, 3  
1024 clients: count : 4, 3 


Further, collected the find_busiest_group() count/s using
funccount.py.

Notice that with 128 clients, most samples with SHARED_RUNQ fall into
the bucket which is > 2x of the buckets where we have most of the
samples of NO_SHARED_RUNQ runs.

128 clients: find_busiest_group() count/s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2000000 - 2500000) :     23
[2500000 - 3000000) :     19               
[3000000 - 3500000) :     19               1
[3500000 - 4000000) :      3               3
[7500000 - 8000000) :                      5
[8000000 - 8500000) :                     54   <===

With 1024 clients, there is not a whole lot of difference in the
find_busiest_group() distribution with and without the SHARED_RUNQ.

1024 clients: find_busiest_group() count/s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  4000 -   5000) :      1
[  7000 -   8000) :      2                  2
[  8000 -   9000) :      1                  2
[  9000 -  10000) :     57                 44  <===
[ 10000 -  11000) :      3                 13
[ 18000 -  19000) :      1                  1



==================================================================
Test          : stream (10  Runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 Copy     1.00 [  0.00]( 0.53)     1.00 [  0.01]( 0.77)     1.00 [ -0.22]( 0.55)     1.00 [  0.12]( 0.71)
Scale     0.95 [  0.00]( 0.23)     0.95 [  0.21]( 0.63)     0.95 [  0.13]( 0.22)     0.95 [  0.02]( 0.87)
  Add     0.97 [  0.00]( 0.27)     0.98 [  0.40]( 0.59)     0.98 [  0.52]( 0.31)     0.98 [  0.16]( 0.85)
Triad     0.98 [  0.00]( 0.28)     0.98 [  0.33]( 0.55)     0.98 [  0.34]( 0.29)     0.98 [  0.05]( 0.96)


==================================================================
Test          : stream (100 Runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 Copy     1.00 [  0.00]( 1.01)     1.00 [ -0.38]( 0.34)     1.00 [  0.08]( 1.19)     1.00 [ -0.18]( 0.38)
Scale     0.95 [  0.00]( 0.46)     0.95 [ -0.39]( 0.52)     0.94 [ -0.72]( 0.34)     0.94 [ -0.66]( 0.40)
  Add     0.98 [  0.00]( 0.16)     0.98 [ -0.40]( 0.53)     0.97 [ -0.80]( 0.26)     0.97 [ -0.79]( 0.34)
Triad     0.98 [  0.00]( 0.14)     0.98 [ -0.35]( 0.54)     0.97 [ -0.79]( 0.17)     0.97 [ -0.79]( 0.28)


==================================================================
Test          : netperf 
Units         : Normalized Througput per client
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:        tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 1-clients      1.00 [  0.00]( 0.84)     0.99 [ -0.64]( 0.10)     0.97 [ -2.61]( 0.29)     0.98 [ -2.24]( 0.16)
 2-clients      1.00 [  0.00]( 0.47)     0.99 [ -1.07]( 0.42)     0.98 [ -2.27]( 0.33)     0.97 [ -2.75]( 0.24)
 4-clients      1.01 [  0.00]( 0.45)     0.99 [ -1.41]( 0.39)     0.98 [ -2.82]( 0.31)     0.97 [ -3.23]( 0.23)
 8-clients      1.00 [  0.00]( 0.39)     0.99 [ -1.95]( 0.29)     0.98 [ -2.78]( 0.25)     0.97 [ -3.62]( 0.39)
16-clients      1.00 [  0.00]( 1.81)     0.97 [ -2.77]( 0.41)     0.97 [ -3.26]( 0.35)     0.96 [ -3.99]( 1.45)
32-clients      1.00 [  0.00]( 1.87)     0.39 [-60.63]( 1.29)     0.95 [ -4.68]( 1.45)     0.95 [ -4.89]( 1.41)
64-clients      0.98 [  0.00]( 2.70)     0.24 [-75.29]( 1.26)     0.66 [-33.23]( 0.99)     0.65 [-34.05]( 2.39)
128-clients     0.90 [  0.00]( 2.48)     0.14 [-84.47]( 3.63)     0.36 [-60.00]( 1.37)     0.36 [-60.36]( 1.54)
256-clients     0.67 [  0.00]( 2.91)     0.08 [-87.79]( 9.27)     0.54 [-20.38]( 3.69)     0.52 [-22.94]( 3.81)
512-clients     0.36 [  0.00]( 8.11)     0.51 [ 39.96]( 4.92)     0.38 [  5.12]( 6.24)     0.39 [  5.88]( 6.13)


==================================================================
Test          : schbench throughput
Units         : Normalized Requests per second 
Interpretation: Higher is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1      1.00 [  0.00]( 0.24)      1.01 [  0.93]( 0.00)      1.01 [  0.93]( 0.24)      1.00 [  0.47]( 0.24)
  2      2.01 [  0.00]( 0.12)      2.03 [  0.93]( 0.00)      2.03 [  1.16]( 0.00)      2.01 [  0.00]( 0.12)
  4      4.03 [  0.00]( 0.12)      4.06 [  0.70]( 0.00)      4.07 [  0.93]( 0.00)      4.02 [ -0.23]( 0.24)
  8      8.05 [  0.00]( 0.00)      8.12 [  0.93]( 0.00)      8.14 [  1.16]( 0.00)      8.07 [  0.23]( 0.00)
 16     16.17 [  0.00]( 0.12)     16.24 [  0.46]( 0.12)     16.28 [  0.69]( 0.00)     16.17 [  0.00]( 0.12)
 32     32.34 [  0.00]( 0.12)     32.49 [  0.46]( 0.00)     32.56 [  0.69]( 0.00)     32.34 [  0.00]( 0.00)
 64     64.52 [  0.00]( 0.12)     64.82 [  0.46]( 0.00)     64.97 [  0.70]( 0.00)     64.52 [  0.00]( 0.00)
128    127.25 [  0.00]( 1.48)    121.57 [ -4.47]( 0.38)    120.37 [ -5.41]( 0.13)    120.07 [ -5.64]( 0.34)
256    135.33 [  0.00]( 0.11)    136.52 [  0.88]( 0.11)    136.22 [  0.66]( 0.11)    136.52 [  0.88]( 0.11)
512    107.81 [  0.00]( 0.29)    109.91 [  1.94]( 0.92)    109.91 [  1.94]( 0.14)    109.91 [  1.94]( 0.14)


==================================================================
Test          : schbench wakeup-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================

#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1       1.00 [ -0.00](14.08)       0.80 [ 20.00](11.92)       1.00 [ -0.00]( 9.68)       1.40 [-40.00](18.75)
  2       1.20 [ -0.00]( 4.43)       1.10 [  8.33]( 4.84)       1.10 [  8.33]( 0.00)       1.10 [  8.33]( 4.56)
  4       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
  8       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
 16       1.10 [ -0.00]( 4.84)       1.20 [ -9.09]( 0.00)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
 32       1.00 [ -0.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.00 [ -0.00]( 0.00)
 64       1.00 [ -0.00]( 5.34)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)
128       1.20 [ -0.00]( 4.19)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 0.00)
256       5.90 [ -0.00]( 0.00)      12.10 [-105.08](14.03)     11.10 [-88.14]( 4.53)      12.70 [-115.25]( 5.17)
512    2627.20 [ -0.00]( 1.21)    2288.00 [ 12.91]( 9.76)    2377.60 [  9.50]( 2.40)    2281.60 [ 13.15]( 0.77)


==================================================================
Test          : schbench request-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1     1.00 [ -0.00]( 0.35)     1.00 [  0.34]( 0.17)     0.99 [  0.67]( 0.30)     1.00 [ -0.34]( 0.00)
  2     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.34]( 0.17)
  4     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.00]( 0.17)
  8     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.17)     0.99 [  1.34]( 0.18)     1.00 [  0.34]( 0.17)
 16     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.17)     0.99 [  1.34]( 0.35)     1.00 [ -0.00]( 0.00)
 32     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.00)     0.99 [  1.34]( 0.00)     1.00 [ -0.00]( 0.00)
 64     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.17)     1.00 [  0.67]( 0.00)     1.00 [ -0.00]( 0.17)
128     1.82 [ -0.00]( 0.83)     1.85 [ -1.48]( 0.00)     1.85 [ -1.85]( 0.37)     1.85 [ -1.85]( 0.19)
256     1.94 [ -0.00]( 0.18)     1.96 [ -1.04]( 0.36)     1.95 [ -0.69]( 0.18)     1.95 [ -0.35]( 0.18)
512    13.27 [ -0.00]( 5.00)    16.32 [-23.00]( 8.33)    16.16 [-21.78]( 1.05)    15.46 [-16.51]( 0.89)

 
 --
 Thanks and Regards
 gautham.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-24 11:14       ` Gautham R. Shenoy
@ 2023-08-24 22:51         ` David Vernet
  2023-08-30  9:56           ` K Prateek Nayak
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-24 22:51 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team

On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
> Hello David,
> 
> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> > 
> > On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> > > On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> > > > Hello David,
> > > 
> > > Hello Gautham,
> > > 
> > > Thanks a lot as always for running some benchmarks and analyzing these
> > > changes.
> > > 
> > > > On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > > > > Changes
> > > > > -------
> > > > > 
> > > > > This is v3 of the shared runqueue patchset. This patch set is based off
> > > > > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > > > > bandwidth in use") on the sched/core branch of tip.git.
> > > > 
> > > > 
> > > > I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> > > > notice that apart from hackbench, every other bechmark is showing
> > > > regressions with this patch series. Quick summary of my observations:
> > > 
> > > Just to verify per our prior conversation [0], was this latest set of
> > > benchmarks run with boost disabled?
> > 
> > Boost is enabled by default. I will queue a run tonight with boost
> > disabled.
> 
> Apologies for the delay. I didn't see any changes with boost-disabled
> and with reverting the optimization to bail out of the
> newidle_balance() for SMT and MC domains when there was no task to be
> pulled from the shared-runq. I reran the whole thing once again, just
> to rule out any possible variance. The results came out the same.

Thanks a lot for taking the time to run more benchmarks.

> With the boost disabled, and the optimization reverted, the results
> don't change much.

Hmmm, I see. So, that was the only real substantive "change" between v2
-> v3. The other changes were supporting hotplug / domain recreation,
optimizing locking a bit, and fixing small bugs like the return value
from shared_runq_pick_next_task(), draining the queue when the feature
is disabled, and fixing the lkp errors.

With all that said, it seems very possible that the regression is due to
changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
bandwidth in use") in v3. EEVDF was merged in that window, so that could
be one explanation for the context switch rate being so much higher.

> It doesn't appear that the optimization is the cause for increase in
> the number of load-balancing attempts at the DIE and the NUMA
> domains. I have shared the counts of the newidle_balance with and
> without SHARED_RUNQ below for tbench and it can be noticed that the
> counts are significantly higher for the 64 clients and 128 clients. I
> also captured the counts/s of find_busiest_group() using funccount.py
> which tells the same story. So the drop in the performance for tbench
> with your patches strongly correlates with the increase in
> load-balancing attempts.
> 
> newidle balance is undertaken only if the overload flag is set and the
> expected idle duration is greater than the avg load balancing cost. It
> is hard to imagine why should the shared runq cause the overload flag
> to be set!

Yeah, I'm not sure either about how or why shared_runq would cause this
This is purely hypothetical, but is it possible that shared_runq causes
idle cores to on average _stay_ idle longer due to other cores pulling
tasks that would have otherwise been load balanced to those cores?

Meaning -- say CPU0 is idle, and there are tasks on other rqs which
could be load balanced. Without shared_runq, CPU0 might be woken up to
run a task from a periodic load balance. With shared_runq, any active
core that would otherwise have gone idle could pull the task, keeping
CPU0 idle.

What do you think? I could be totally off here.

From my perspective, I'm not too worried about this given that we're
seeing gains in other areas such as kernel compile as I showed in [0],
though I definitely would like to better understand it.

[0]: https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/

> Detailed Results are as follows:
> =============================================================
> Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.
> 
> tip             : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> v3              : v3 of the shared_runq patch
> v3-tgfix        : v3+ Aaron's RFC v1 patch to ratelimit the updates to tg->load_avg
> v3-tgfix-no-opt : v3-tgfix + revered the optimization to bail out of
>                   newidle-balance for SMT and MC domains when there
>                   are no tasks in the shared-runq
> 
> In the results below, I have chosen the first row, first column in the
> table as the baseline so that we get an idea of the scalability issues
> as the number of groups/clients/workers increase.
> 
> ==================================================================
> Test          : hackbench 
> Units         : Normalized time in seconds 
> Interpretation: Lower is better 
> Statistic     : AMean 
> ==================================================================
> Case:         tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 4.22)     0.92 [  7.75]( 9.09)     0.88 [ 11.53](10.61)     0.85 [ 15.31]( 8.20)
>  2-groups     0.88 [ -0.00](11.65)     0.85 [  2.95](10.77)     0.88 [ -0.91]( 9.69)     0.88 [ -0.23]( 9.20)
>  4-groups     1.08 [ -0.00]( 3.70)     0.93 [ 13.86](11.03)     0.90 [ 16.08]( 9.57)     0.83 [ 22.92]( 6.98)
>  8-groups     1.32 [ -0.00]( 0.63)     1.16 [ 12.33]( 9.05)     1.21 [  8.72]( 5.54)     1.17 [ 11.13]( 5.29)
> 16-groups     1.71 [ -0.00]( 0.63)     1.93 [-12.65]( 4.68)     1.27 [ 25.87]( 1.31)     1.25 [ 27.15]( 1.10)

Great, looks like Aaron's patch really helps.

> ==================================================================
> Test          : tbench 
> Units         : Normalized throughput 
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:   tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>     1      1.00 [  0.00]( 0.18)      0.99 [ -0.99]( 0.18)      0.98 [ -2.08]( 0.10)      0.98 [ -2.19]( 0.24)
>     2      1.95 [  0.00]( 0.65)      1.93 [ -1.04]( 0.72)      1.95 [ -0.37]( 0.31)      1.92 [ -1.73]( 0.39)
>     4      3.80 [  0.00]( 0.59)      3.78 [ -0.53]( 0.37)      3.73 [ -1.66]( 0.58)      3.77 [ -0.79]( 0.97)
>     8      7.49 [  0.00]( 0.37)      7.41 [ -1.12]( 0.39)      7.24 [ -3.42]( 1.99)      7.39 [ -1.39]( 1.53)
>    16     14.78 [  0.00]( 0.84)     14.60 [ -1.24]( 1.51)     14.30 [ -3.28]( 1.28)     14.46 [ -2.18]( 0.78)
>    32     28.18 [  0.00]( 1.26)     26.59 [ -5.65]( 0.46)     27.70 [ -1.71]( 0.92)     27.08 [ -3.90]( 0.83)
>    64     55.05 [  0.00]( 1.56)     18.25 [-66.85]( 0.25)     48.07 [-12.68]( 1.51)     47.46 [-13.79]( 2.70)
>   128    102.26 [  0.00]( 1.03)     21.74 [-78.74]( 0.65)     54.65 [-46.56]( 1.35)     54.69 [-46.52]( 1.16)
>   256    156.69 [  0.00]( 0.27)     25.47 [-83.74]( 0.07)    130.85 [-16.49]( 0.57)    125.00 [-20.23]( 0.35)
>   512    223.22 [  0.00]( 8.25)    236.98 [  6.17](17.10)    274.47 [ 22.96]( 0.44)    276.95 [ 24.07]( 3.37)
>  1024    237.98 [  0.00]( 1.09)    299.72 [ 25.94]( 0.24)    304.89 [ 28.12]( 0.73)    300.37 [ 26.22]( 1.16)
>  2048    242.13 [  0.00]( 0.37)    311.38 [ 28.60]( 0.24)    299.82 [ 23.82]( 1.35)    291.32 [ 20.31]( 0.66)
> 
> 
> I reran tbench for v3-tgfix-no-opt, to collect the newidle balance
> counts via schedstat as well as the find_busiest_group() counts via
> funccount.py.
> 
> Comparison of the newidle balance counts across different
> sched-domains for "v3-tgfix-no-opt" kernel with NO_SHARED_RUNQ vs
> SHARED_RUNQ. We see a huge blowup for the DIE and the NUMA domains
> when the number of clients are 64 and 128. The value within |xx.yy|
> indicates the percentage increase when the difference is significant.
> 
> ============== SMT load_balance with CPU_NEWLY_IDLE ===============================
>    1 clients: count : 1986, 1960 
>    2 clients: count : 5777, 6543     |  13.26|
>    4 clients: count : 16775, 15274   |  -8.95|
>    8 clients: count : 37086, 32715   | -11.79|
>   16 clients: count : 69627, 65652   |  -5.71|
>   32 clients: count : 152288, 42723  | -71.95|
>   64 clients: count : 216396, 169545 | -21.65|
>  128 clients: count : 219570, 649880 | 195.98|
>  256 clients: count : 443595, 951933 | 114.60|
>  512 clients: count : 5498, 1949     | -64.55|
> 1024 clients: count : 60, 3          | -95.00|
> ================ MC load_balance with CPU_NEWLY_IDLE ===============================
>    1 clients: count : 1954, 1943
>    2 clients: count : 5775, 6541      |  13.26|
>    4 clients: count : 15468, 15087 
>    8 clients: count : 31941, 32140 
>   16 clients: count : 57312, 62553    |   9.14|
>   32 clients: count : 125791, 34386   | -72.66|
>   64 clients: count : 181406, 133978  | -26.14|
>  128 clients: count : 191143, 607594  | 217.87|
>  256 clients: count : 388696, 584568  |  50.39| 
>  512 clients: count : 2677, 218       | -91.86|
> 1024 clients: count : 22, 3           | -86.36|
> =============== DIE load_balance with CPU_NEWLY_IDLE ===============================
>    1 clients: count : 10, 15          |   50.00|
>    2 clients: count : 15, 56          |  273.33|
>    4 clients: count : 65, 149         |  129.23|
>    8 clients: count : 242, 412        |   70.25|
>   16 clients: count : 509, 1235       |  142.63|
>   32 clients: count : 909, 1371       |   50.83|
>   64 clients: count : 1288, 59596     | 4527.02| <===
>  128 clients: count : 666, 281426     |42156.16| <===
>  256 clients: count : 213, 1463       |  586.85|
>  512 clients: count : 28, 23          |  -17.86|
> 1024 clients: count : 10, 3           |  -70.00|
> ============== NUMA load_balance with CPU_NEWLY_IDLE ===============================
>    1 clients: count : 9, 9 
>    2 clients: count : 13, 14
>    4 clients: count : 21, 21
>    8 clients: count : 27, 29
>   16 clients: count : 29, 50         |   72.41|
>   32 clients: count : 29, 67         |  131.03|
>   64 clients: count : 28, 9138       |32535.71|  <===
>  128 clients: count : 25, 24234      |96836.00|  <===
>  256 clients: count : 12, 11
>  512 clients: count : 7, 3  
> 1024 clients: count : 4, 3 
> 
> 
> Further, collected the find_busiest_group() count/s using
> funccount.py.
> 
> Notice that with 128 clients, most samples with SHARED_RUNQ fall into
> the bucket which is > 2x of the buckets where we have most of the
> samples of NO_SHARED_RUNQ runs.
> 
> 128 clients: find_busiest_group() count/s
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [2000000 - 2500000) :     23
> [2500000 - 3000000) :     19               
> [3000000 - 3500000) :     19               1
> [3500000 - 4000000) :      3               3
> [7500000 - 8000000) :                      5
> [8000000 - 8500000) :                     54   <===
> 
> With 1024 clients, there is not a whole lot of difference in the
> find_busiest_group() distribution with and without the SHARED_RUNQ.
> 
> 1024 clients: find_busiest_group() count/s
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [  4000 -   5000) :      1
> [  7000 -   8000) :      2                  2
> [  8000 -   9000) :      1                  2
> [  9000 -  10000) :     57                 44  <===
> [ 10000 -  11000) :      3                 13
> [ 18000 -  19000) :      1                  1
> 
> 
> 
> ==================================================================
> Test          : stream (10  Runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.53)     1.00 [  0.01]( 0.77)     1.00 [ -0.22]( 0.55)     1.00 [  0.12]( 0.71)
> Scale     0.95 [  0.00]( 0.23)     0.95 [  0.21]( 0.63)     0.95 [  0.13]( 0.22)     0.95 [  0.02]( 0.87)
>   Add     0.97 [  0.00]( 0.27)     0.98 [  0.40]( 0.59)     0.98 [  0.52]( 0.31)     0.98 [  0.16]( 0.85)
> Triad     0.98 [  0.00]( 0.28)     0.98 [  0.33]( 0.55)     0.98 [  0.34]( 0.29)     0.98 [  0.05]( 0.96)
> 
> 
> ==================================================================
> Test          : stream (100 Runs)
> Units         : Normalized Bandwidth, MB/s 
> Interpretation: Higher is better 
> Statistic     : HMean 
> ==================================================================
> Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>  Copy     1.00 [  0.00]( 1.01)     1.00 [ -0.38]( 0.34)     1.00 [  0.08]( 1.19)     1.00 [ -0.18]( 0.38)
> Scale     0.95 [  0.00]( 0.46)     0.95 [ -0.39]( 0.52)     0.94 [ -0.72]( 0.34)     0.94 [ -0.66]( 0.40)
>   Add     0.98 [  0.00]( 0.16)     0.98 [ -0.40]( 0.53)     0.97 [ -0.80]( 0.26)     0.97 [ -0.79]( 0.34)
> Triad     0.98 [  0.00]( 0.14)     0.98 [ -0.35]( 0.54)     0.97 [ -0.79]( 0.17)     0.97 [ -0.79]( 0.28)
> 
> 
> ==================================================================
> Test          : netperf 
> Units         : Normalized Througput per client
> Interpretation: Higher is better 
> Statistic     : AMean 
> ==================================================================
> Clients:        tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>  1-clients      1.00 [  0.00]( 0.84)     0.99 [ -0.64]( 0.10)     0.97 [ -2.61]( 0.29)     0.98 [ -2.24]( 0.16)
>  2-clients      1.00 [  0.00]( 0.47)     0.99 [ -1.07]( 0.42)     0.98 [ -2.27]( 0.33)     0.97 [ -2.75]( 0.24)
>  4-clients      1.01 [  0.00]( 0.45)     0.99 [ -1.41]( 0.39)     0.98 [ -2.82]( 0.31)     0.97 [ -3.23]( 0.23)
>  8-clients      1.00 [  0.00]( 0.39)     0.99 [ -1.95]( 0.29)     0.98 [ -2.78]( 0.25)     0.97 [ -3.62]( 0.39)
> 16-clients      1.00 [  0.00]( 1.81)     0.97 [ -2.77]( 0.41)     0.97 [ -3.26]( 0.35)     0.96 [ -3.99]( 1.45)
> 32-clients      1.00 [  0.00]( 1.87)     0.39 [-60.63]( 1.29)     0.95 [ -4.68]( 1.45)     0.95 [ -4.89]( 1.41)
> 64-clients      0.98 [  0.00]( 2.70)     0.24 [-75.29]( 1.26)     0.66 [-33.23]( 0.99)     0.65 [-34.05]( 2.39)
> 128-clients     0.90 [  0.00]( 2.48)     0.14 [-84.47]( 3.63)     0.36 [-60.00]( 1.37)     0.36 [-60.36]( 1.54)
> 256-clients     0.67 [  0.00]( 2.91)     0.08 [-87.79]( 9.27)     0.54 [-20.38]( 3.69)     0.52 [-22.94]( 3.81)
> 512-clients     0.36 [  0.00]( 8.11)     0.51 [ 39.96]( 4.92)     0.38 [  5.12]( 6.24)     0.39 [  5.88]( 6.13)
> 
> 
> ==================================================================
> Test          : schbench throughput
> Units         : Normalized Requests per second 
> Interpretation: Higher is better 
> Statistic     : Median 
> ==================================================================
> #workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>   1      1.00 [  0.00]( 0.24)      1.01 [  0.93]( 0.00)      1.01 [  0.93]( 0.24)      1.00 [  0.47]( 0.24)
>   2      2.01 [  0.00]( 0.12)      2.03 [  0.93]( 0.00)      2.03 [  1.16]( 0.00)      2.01 [  0.00]( 0.12)
>   4      4.03 [  0.00]( 0.12)      4.06 [  0.70]( 0.00)      4.07 [  0.93]( 0.00)      4.02 [ -0.23]( 0.24)
>   8      8.05 [  0.00]( 0.00)      8.12 [  0.93]( 0.00)      8.14 [  1.16]( 0.00)      8.07 [  0.23]( 0.00)
>  16     16.17 [  0.00]( 0.12)     16.24 [  0.46]( 0.12)     16.28 [  0.69]( 0.00)     16.17 [  0.00]( 0.12)
>  32     32.34 [  0.00]( 0.12)     32.49 [  0.46]( 0.00)     32.56 [  0.69]( 0.00)     32.34 [  0.00]( 0.00)
>  64     64.52 [  0.00]( 0.12)     64.82 [  0.46]( 0.00)     64.97 [  0.70]( 0.00)     64.52 [  0.00]( 0.00)
> 128    127.25 [  0.00]( 1.48)    121.57 [ -4.47]( 0.38)    120.37 [ -5.41]( 0.13)    120.07 [ -5.64]( 0.34)
> 256    135.33 [  0.00]( 0.11)    136.52 [  0.88]( 0.11)    136.22 [  0.66]( 0.11)    136.52 [  0.88]( 0.11)
> 512    107.81 [  0.00]( 0.29)    109.91 [  1.94]( 0.92)    109.91 [  1.94]( 0.14)    109.91 [  1.94]( 0.14)
> 
> 
> ==================================================================
> Test          : schbench wakeup-latency 
> Units         : Normalized 99th percentile latency in us 
> Interpretation: Lower is better 
> Statistic     : Median 
> ==================================================================
> 
> #workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>   1       1.00 [ -0.00](14.08)       0.80 [ 20.00](11.92)       1.00 [ -0.00]( 9.68)       1.40 [-40.00](18.75)
>   2       1.20 [ -0.00]( 4.43)       1.10 [  8.33]( 4.84)       1.10 [  8.33]( 0.00)       1.10 [  8.33]( 4.56)
>   4       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
>   8       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
>  16       1.10 [ -0.00]( 4.84)       1.20 [ -9.09]( 0.00)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
>  32       1.00 [ -0.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.00 [ -0.00]( 0.00)
>  64       1.00 [ -0.00]( 5.34)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)
> 128       1.20 [ -0.00]( 4.19)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 0.00)
> 256       5.90 [ -0.00]( 0.00)      12.10 [-105.08](14.03)     11.10 [-88.14]( 4.53)      12.70 [-115.25]( 5.17)
> 512    2627.20 [ -0.00]( 1.21)    2288.00 [ 12.91]( 9.76)    2377.60 [  9.50]( 2.40)    2281.60 [ 13.15]( 0.77)
> 
> 
> ==================================================================
> Test          : schbench request-latency 
> Units         : Normalized 99th percentile latency in us 
> Interpretation: Lower is better 
> Statistic     : Median 
> ==================================================================
> #workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
>   1     1.00 [ -0.00]( 0.35)     1.00 [  0.34]( 0.17)     0.99 [  0.67]( 0.30)     1.00 [ -0.34]( 0.00)
>   2     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.34]( 0.17)
>   4     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.00]( 0.17)
>   8     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.17)     0.99 [  1.34]( 0.18)     1.00 [  0.34]( 0.17)
>  16     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.17)     0.99 [  1.34]( 0.35)     1.00 [ -0.00]( 0.00)
>  32     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.00)     0.99 [  1.34]( 0.00)     1.00 [ -0.00]( 0.00)
>  64     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.17)     1.00 [  0.67]( 0.00)     1.00 [ -0.00]( 0.17)
> 128     1.82 [ -0.00]( 0.83)     1.85 [ -1.48]( 0.00)     1.85 [ -1.85]( 0.37)     1.85 [ -1.85]( 0.19)
> 256     1.94 [ -0.00]( 0.18)     1.96 [ -1.04]( 0.36)     1.95 [ -0.69]( 0.18)     1.95 [ -0.35]( 0.18)
> 512    13.27 [ -0.00]( 5.00)    16.32 [-23.00]( 8.33)    16.16 [-21.78]( 1.05)    15.46 [-16.51]( 0.89)

So as I said above, I definitely would like to better understand why
we're hammering load_balance() so hard in a few different contexts. I'll
try to repro the issue with tbench on a few different configurations. If
I'm able to, the next step would be for me to investigate my theory,
likely by doing something like measuring rq->avg_idle at wakeup time and
in newidle_balance() using bpftrace. If avg_idle is a lot higher for
waking cores, maybe my theory isn't too far fetched?

With all that said, it's been pretty clear from early on in the patch
set that there were going to be tradeoffs to enabling SHARED_RUNQ. It's
not surprising to me that there are some configurations that really
don't tolerate it well, and others that benefit from it a lot. Hackbench
and kernel compile seem to be two such examples; hackbench especially.
At Meta, we get really nice gains from it on a few of our biggest
services. So my hope is that we don't have to tweak every possible use
case in order for the patch set to be merged, as we've already done a
lot of due diligence relative to other sched features.

I would appreciate hearing what others think as well.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
  2023-08-09 23:46   ` kernel test robot
  2023-08-10  7:11   ` kernel test robot
@ 2023-08-30  6:17   ` Chen Yu
  2023-08-31  0:01     ` David Vernet
  2 siblings, 1 reply; 52+ messages in thread
From: Chen Yu @ 2023-08-30  6:17 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team, tim.c.chen

Hi David,

On 2023-08-09 at 17:12:18 -0500, David Vernet wrote:
> The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
> tasks are put into on enqueue, and pulled from when a core in that LLC
> would otherwise go idle. For CPUs with large LLCs, this can sometimes
> cause significant contention, as illustrated in [0].
> 
> [0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> 
> So as to try and mitigate this contention, we can instead shard the
> per-LLC runqueue into multiple per-LLC shards.
> 
> While this doesn't outright prevent all contention, it does somewhat mitigate it.
>
 
Thanks for this proposal to make idle load balance more efficient. As we
dicussed previously, I launched hackbench on Intel Sapphire Rapids
and I have some findings.

This platform has 2 sockets, each socket has 56C/112T. To avoid the
run-run variance, only 1 socket is online, the cpufreq govenor is set to
performance, the turbo is disabled, and C-states deeper than C1 are disabled.

hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  1.09)   +0.55 (  0.20)
process-pipe            2-groups         1.00 (  0.60)   +3.57 (  0.28)
process-pipe            4-groups         1.00 (  0.30)   +5.22 (  0.26)
process-pipe            8-groups         1.00 (  0.10)  +43.96 (  0.26)
process-sockets         1-groups         1.00 (  0.18)   -1.56 (  0.34)
process-sockets         2-groups         1.00 (  1.06)  -12.37 (  0.11)
process-sockets         4-groups         1.00 (  0.29)   +0.21 (  0.19)
process-sockets         8-groups         1.00 (  0.06)   +3.59 (  0.39)

The 8 groups pipe mode has an impressive improvement, while the 2 groups sockets
mode did see some regressions.

The possible reason to cause the regression is at the end of this reply, in
case you want to see the conclusion directly : )

To investigate the regression, I did slight hack on the hackbench, by renaming
the workload to sender and receiver.

When it is in 2 groups mode, there would be 2 groups of senders and receivers.
Each group has 14 senders and 14 receivers. So there are totally 56 tasks running
on 112 CPUs. In each group, sender_i sends package to receiver_j  ( i, j belong to [1,14] )


1. Firstly use 'top' to monitor the CPU utilization:

   When shared_runqueue is disabled, many CPUs are 100%, while the other
   CPUs remain 0%.
   When shared_runqueue is enabled, most CPUs are busy and the utilization is
   in 40%~60%.

   This means that shared_runqueue works as expected.

2. Then the bpf wakeup latency is monitored:

tracepoint:sched:sched_wakeup,
tracepoint:sched:sched_wakeup_new
{
        if (args->comm == "sender") {
                @qstime[args->pid] = nsecs;
        }
        if (args->comm == "receiver") {
                @qrtime[args->pid] = nsecs;
        }
}

tracepoint:sched:sched_switch
{
        if (args->next_comm == "sender") {
                $ns = @qstime[args->next_pid];
                if ($ns) {
                        @sender_wakeup_lat = hist((nsecs - $ns) / 1000);
                        delete(@qstime[args->next_pid]);
                }
        }
        if (args->next_comm == "receiver") {
                $ns = @qrtime[args->next_pid];
                if ($ns) {
                        @receiver_wakeup_lat = hist((nsecs - $ns) / 1000);
                        delete(@qstime[args->next_pid]);
                }
        }
}


It shows that, the wakeup latency of the receiver has been increased a little
bit. But consider that this symptom is the same when the hackbench is in pipe mode,
and there is no regression in pipe mode, the wakeup latency overhead might not be
the cause of the regression.

3. Then FlameGraph is used to compare the bottleneck.
There is still no obvious difference noticed. One obvious bottleneck is the atomic
write to a memory cgroup page count(and runqueue lock contention is not observed).
The backtrace:

obj_cgroup_charge_pages;obj_cgroup_charge;__kmem_cache_alloc_node;
__kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;
alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg

However there is no obvious ratio difference between with/without shared runqueue
enabled. So this one might not be the cause.

4. Check the wakeup task migration count

Borrow the script from Aaron:
kretfunc:select_task_rq_fair
{
        $p = (struct task_struct *)args->p;
        if ($p->comm == "sender") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_sender = count();
                } else {
                        @wakeup_prev_sender = count();
                }
        }
        if ($p->comm == "receiver") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_receiver = count();
                } else {
                        @wakeup_prev_receiver = count();
                }
        }
}

Without shared_runqueue enabled, the wakee task are mostly woken up on it
previous running CPUs.
With shared_runqueue disabled, the wakee task are mostly woken up on a
completely different idle CPUs.

This reminds me that, is it possible the regression was caused by the broken
cache locallity?


5. Check the L2 cache miss rate.
perf stat -e l2_rqsts.references,l2_request.miss sleep 10
The results show that the L2 cache miss rate is nearly the same with/without
shared_runqueue enabled.

I did not check the L3 miss rate, because:
   3.1 there is only 1 socket of CPUs online
   3.2 the working set the hackbench is 56 * 100 * 300000, which is nearly
       the same as LLC cache size.

6. As mentioned in step 3, the bottleneck is a atomic write to a global
   variable. Then use perf c2c to check if there is any false/true sharing.

   According to the result, the total number and average cycles of local HITM
   is low.So this might indicate that this is not a false sharing or true
   sharing issue.


7. Then use perf topdown to dig into the detail. The methodology is at
   https://perf.wiki.kernel.org/index.php/Top-Down_Analysis


   When shared runqueue is disabled:

    #     65.2 %  tma_backend_bound
    #      2.9 %  tma_bad_speculation
    #     13.1 %  tma_frontend_bound
    #     18.7 %  tma_retiring



   When shared runqueue is enabled:

    #     52.4 %  tma_backend_bound
    #      3.3 %  tma_bad_speculation
    #     20.5 %  tma_frontend_bound
    #     23.8 %  tma_retiring
    

We can see that, the ratio of frontend_bound has increased from 13.1% to
20.5%.  As a comparison, this ratio does not increase when the hackbench
is in pipe mode.

Then further dig into the deeper level of frontend_bound:

When shared runqueue is disabled:
#      6.9 %  tma_fetch_latency   ---->  #      7.3 %  tma_ms_switches
                                  |
                                  ---->  #      7.1 %  tma_dsb_switches


When shared runqueue is enabled:
#     11.6 %  tma_fetch_latency   ----> #      6.7 %  tma_ms_switches
                                  |
                                  ----> #      7.8 %  tma_dsb_switches


1. The DSB(Decode Stream Buffer) switches count increases
   from 13.1% * 6.9% * 7.1% to 20.5% * 11.6% * 7.8%

2. The MS(Microcode Sequencer) switches count increases
   from 13.1% * 6.9% * 7.3% to 20.5% * 11.6% * 6.7%

DSB is the cached decoded uops, which is similar to L1 icache,
except that icache has the original instructions, while DSB has the
decoded one. DSB reflects the instruction footprint. The increase
of DSB switches mean that, the cached buffer has been thrashed a lot.

MS is to decode the complex instructions, the increase of MS switch counter
usually means that the workload is running some complex instruction.
that the workload is running complex instructions.

In summary:

So the scenario to cause this issue I'm thinking of is:
Task migration increases the DSB switches count. With shared_runqueue enabled,
the task could be migrated to different CPUs more offen. And it has to fill its
new uops into the DSB, but that DSB has already been filled by the old task's
uops. So DSB switches is triggered to decode the new macro ops. This is usually
not a problem if the workload runs some simple instructions. However if
this workload's instruction footprint increases, task migration might break
the DSB uops locality, which is similar to L1/L2 cache locality.

I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
on the per-LLC shared runqueue? In this way it will not be migrated too offen
so as to keep its locality(both in terms of L1/L2 cache and DSB).

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
  2023-08-10  7:11   ` kernel test robot
  2023-08-10  7:41   ` kernel test robot
@ 2023-08-30  6:46   ` K Prateek Nayak
  2023-08-31  1:34     ` David Vernet
  2 siblings, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-30  6:46 UTC (permalink / raw)
  To: David Vernet, linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, aaron.lu, wuyun.abel, kernel-team

Hello David,

On 8/10/2023 3:42 AM, David Vernet wrote:
> [..snip..]
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2aab7be46f7e..8238069fd852 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -769,6 +769,8 @@ struct task_struct {
>  	unsigned long			wakee_flip_decay_ts;
>  	struct task_struct		*last_wakee;
>  
> +	struct list_head		shared_runq_node;
> +
>  	/*
>  	 * recent_used_cpu is initially set as the last CPU used by a task
>  	 * that wakes affine another task. Waker/wakee relationships can
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 385c565da87f..fb7e71d3dc0a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4529,6 +4529,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  #ifdef CONFIG_SMP
>  	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>  	p->migration_pending = NULL;
> +	INIT_LIST_HEAD(&p->shared_runq_node);
>  #endif
>  	init_sched_mm_cid(p);
>  }
> @@ -9764,6 +9765,18 @@ int sched_cpu_deactivate(unsigned int cpu)
>  	return 0;
>  }
>  
> +void sched_update_domains(void)
> +{
> +	const struct sched_class *class;
> +
> +	update_sched_domain_debugfs();
> +
> +	for_each_class(class) {
> +		if (class->update_domains)
> +			class->update_domains();
> +	}
> +}
> +
>  static void sched_rq_cpu_starting(unsigned int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9c23e3b948fc..6e740f8da578 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -139,20 +139,235 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>  }
>  __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>  
> +/**
> + * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
> + * runnable tasks within an LLC.
> + *
> + * WHAT
> + * ====
> + *
> + * This structure enables the scheduler to be more aggressively work
> + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> + * pulled from when another core in the LLC is going to go idle.
> + *
> + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> + * Waking tasks are enqueued in the calling CPU's struct shared_runq in
> + * __enqueue_entity(), and are opportunistically pulled from the shared_runq
> + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> + * to being pulled from the shared_runq, in which case they're simply dequeued
> + * from the shared_runq in __dequeue_entity().
> + *
> + * There is currently no task-stealing between shared_runqs in different LLCs,
> + * which means that shared_runq is not fully work conserving. This could be
> + * added at a later time, with tasks likely only being stolen across
> + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> + *
> + * HOW
> + * ===
> + *
> + * A shared_runq is comprised of a list, and a spinlock for synchronization.
> + * Given that the critical section for a shared_runq is typically a fast list
> + * operation, and that the shared_runq is localized to a single LLC, the
> + * spinlock will typically only be contended on workloads that do little else
> + * other than hammer the runqueue.
> + *
> + * WHY
> + * ===
> + *
> + * As mentioned above, the main benefit of shared_runq is that it enables more
> + * aggressive work conservation in the scheduler. This can benefit workloads
> + * that benefit more from CPU utilization than from L1/L2 cache locality.
> + *
> + * shared_runqs are segmented across LLCs both to avoid contention on the
> + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> + * it, as well as to strike a balance between work conservation, and L3 cache
> + * locality.
> + */
> +struct shared_runq {
> +	struct list_head list;
> +	raw_spinlock_t lock;
> +} ____cacheline_aligned;
> +
>  #ifdef CONFIG_SMP
> +
> +static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
> +
> +static struct shared_runq *rq_shared_runq(struct rq *rq)
> +{
> +	return rq->cfs.shared_runq;
> +}
> +
> +static void shared_runq_reassign_domains(void)
> +{
> +	int i;
> +	struct shared_runq *shared_runq;
> +	struct rq *rq;
> +	struct rq_flags rf;
> +
> +	for_each_possible_cpu(i) {
> +		rq = cpu_rq(i);
> +		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
> +
> +		rq_lock(rq, &rf);
> +		rq->cfs.shared_runq = shared_runq;
> +		rq_unlock(rq, &rf);
> +	}
> +}
> +
> +static void __shared_runq_drain(struct shared_runq *shared_runq)
> +{
> +	struct task_struct *p, *tmp;
> +
> +	raw_spin_lock(&shared_runq->lock);
> +	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
> +		list_del_init(&p->shared_runq_node);
> +	raw_spin_unlock(&shared_runq->lock);
> +}
> +
> +static void update_domains_fair(void)
> +{
> +	int i;
> +	struct shared_runq *shared_runq;
> +
> +	/* Avoid racing with SHARED_RUNQ enable / disable. */
> +	lockdep_assert_cpus_held();
> +
> +	shared_runq_reassign_domains();
> +
> +	/* Ensure every core sees its updated shared_runq pointers. */
> +	synchronize_rcu();
> +
> +	/*
> +	 * Drain all tasks from all shared_runq's to ensure there are no stale
> +	 * tasks in any prior domain runq. This can cause us to drain live
> +	 * tasks that would otherwise have been safe to schedule, but this
> +	 * isn't a practical problem given how infrequently domains are
> +	 * rebuilt.
> +	 */
> +	for_each_possible_cpu(i) {
> +		shared_runq = &per_cpu(shared_runqs, i);
> +		__shared_runq_drain(shared_runq);
> +	}
> +}
> +
>  void shared_runq_toggle(bool enabling)
> -{}
> +{
> +	int cpu;
> +
> +	if (enabling)
> +		return;
> +
> +	/* Avoid racing with hotplug. */
> +	lockdep_assert_cpus_held();
> +
> +	/* Ensure all cores have stopped enqueueing / dequeuing tasks. */
> +	synchronize_rcu();
> +
> +	for_each_possible_cpu(cpu) {
> +		int sd_id;
> +
> +		sd_id = per_cpu(sd_llc_id, cpu);
> +		if (cpu == sd_id)
> +			__shared_runq_drain(rq_shared_runq(cpu_rq(cpu)));
> +	}
> +}
> +
> +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +	struct shared_runq *shared_runq;
> +
> +	shared_runq = rq_shared_runq(rq);
> +	if (list_empty(&shared_runq->list))
> +		return NULL;
> +
> +	raw_spin_lock(&shared_runq->lock);
> +	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> +				     shared_runq_node);
> +	if (p && is_cpu_allowed(p, cpu_of(rq)))
> +		list_del_init(&p->shared_runq_node);

I wonder if we should remove the task from the list if
"is_cpu_allowed()" return false.

Consider the following scenario: A task that does not sleep, is pinned
to single CPU. Since this is now at the head of the list, and cannot be
moved, we leave it there, but since the task also never sleeps, it'll
stay there, thus preventing the queue from doing its job.

Further implication ...  

> +	else
> +		p = NULL;
> +	raw_spin_unlock(&shared_runq->lock);
> +
> +	return p;
> +}
> +
> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct shared_runq *shared_runq;
> +
> +	shared_runq = rq_shared_runq(rq);
> +	raw_spin_lock(&shared_runq->lock);
> +	list_add_tail(&p->shared_runq_node, &shared_runq->list);
> +	raw_spin_unlock(&shared_runq->lock);
> +}
>  
>  static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
> -{}
> +{
> +	/*
> +	 * Only enqueue the task in the shared runqueue if:
> +	 *
> +	 * - SHARED_RUNQ is enabled
> +	 * - The task isn't pinned to a specific CPU
> +	 */
> +	if (p->nr_cpus_allowed == 1)
> +		return;
> +
> +	shared_runq_push_task(rq, p);
> +}
>  
>  static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  {
> -	return 0;
> +	struct task_struct *p = NULL;
> +	struct rq *src_rq;
> +	struct rq_flags src_rf;
> +	int ret = -1;
> +
> +	p = shared_runq_pop_task(rq);
> +	if (!p)
> +		return 0;

...

Since we return 0 here in such a scenario, we'll take the old
newidle_balance() path but ...

> +
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	src_rq = task_rq_lock(p, &src_rf);
> +
> +	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
> +		update_rq_clock(src_rq);
> +		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> +		ret = 1;
> +	}
> +
> +	if (src_rq != rq) {
> +		task_rq_unlock(src_rq, p, &src_rf);
> +		raw_spin_rq_lock(rq);
> +	} else {
> +		rq_unpin_lock(rq, &src_rf);
> +		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> +	}
> +	rq_repin_lock(rq, rf);
> +
> +	return ret;
>  }
>  
>  static void shared_runq_dequeue_task(struct task_struct *p)
> -{}
> +{
> +	struct shared_runq *shared_runq;
> +
> +	if (!list_empty(&p->shared_runq_node)) {
> +		shared_runq = rq_shared_runq(task_rq(p));
> +		raw_spin_lock(&shared_runq->lock);
> +		/*
> +		 * Need to double-check for the list being empty to avoid
> +		 * racing with the list being drained on the domain recreation
> +		 * or SHARED_RUNQ feature enable / disable path.
> +		 */
> +		if (likely(!list_empty(&p->shared_runq_node)))
> +			list_del_init(&p->shared_runq_node);
> +		raw_spin_unlock(&shared_runq->lock);
> +	}
> +}
>  
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	rcu_read_lock();
>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
> +	/*
> +	 * Skip <= LLC domains as they likely won't have any tasks if the
> +	 * shared runq is empty.
> +	 */

... now we skip all the way ahead of MC domain, overlooking any
imbalance that might still exist within the SMT and MC groups
since shared runq is not exactly empty.

Let me know if I've got something wrong!

> +	if (sched_feat(SHARED_RUNQ)) {
> +		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> +		if (likely(sd))
> +			sd = sd->parent;
> +	}

Speaking of skipping ahead of MC domain, I don't think this actually
works since the domain traversal uses the "for_each_domain" macro
which is defined as:

#define for_each_domain(cpu, __sd) \
	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
			__sd; __sd = __sd->parent)

The traversal starts from rq->sd overwriting your initialized value
here. This is why we see "load_balance count on cpu newly idle" in
Gautham's first report
(https://lore.kernel.org/lkml/ZN3dW5Gvcb0LFWjs@BLR-5CG11610CF.amd.com/)
to be non-zero.

One way to do this would be as follows:

static int newidle_balance() {

	...
	for_each_domain(this_cpu, sd) {

		...
		/* Skip balancing until LLc domain */
		if (sched_feat(SHARED_RUNQ) &&
		    (sd->flags & SD_SHARE_PKG_RESOURCES))
			continue;

		...
	}
	...
}

With this I see the newidle balance count for SMT and MC domain
to be zero:

< ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
--
< ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
--
< ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
load_balance cnt on cpu newly idle                         :       2170    $      9.319 $    [   17.42832 ]
--
< ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
load_balance cnt on cpu newly idle                         :         30    $    674.067 $    [    0.24094 ]
--

Let me know if I'm missing something here :)

Note: The lb counts for DIE and NUMA are down since I'm experimenting
with the implementation currently. I'll update of any new findings on
the thread.

> +
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> [..snip..]
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-24 22:51         ` David Vernet
@ 2023-08-30  9:56           ` K Prateek Nayak
  2023-08-31  2:32             ` David Vernet
  2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
  0 siblings, 2 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-30  9:56 UTC (permalink / raw)
  To: David Vernet, Gautham R. Shenoy
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, aaron.lu, wuyun.abel, kernel-team,
	kprateek.nayak

Hello David,

Short update based on some of my experimentation.

Disclaimer: I've been only looking at tbench 128 client case on a dual
socket 3rd Generation EPYC system (2x 64C/128T). Wider results may
vary but I have some information that may help with the debug and to
proceed further.

On 8/25/2023 4:21 AM, David Vernet wrote:
> On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
>> Hello David,
>>
>> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
>>> Hello David,
>>>
>>> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
>>>> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
>>>>> Hello David,
>>>>
>>>> Hello Gautham,
>>>>
>>>> Thanks a lot as always for running some benchmarks and analyzing these
>>>> changes.
>>>>
>>>>> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
>>>>>> Changes
>>>>>> -------
>>>>>>
>>>>>> This is v3 of the shared runqueue patchset. This patch set is based off
>>>>>> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>>>>>> bandwidth in use") on the sched/core branch of tip.git.
>>>>>
>>>>>
>>>>> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
>>>>> notice that apart from hackbench, every other bechmark is showing
>>>>> regressions with this patch series. Quick summary of my observations:
>>>>
>>>> Just to verify per our prior conversation [0], was this latest set of
>>>> benchmarks run with boost disabled?
>>>
>>> Boost is enabled by default. I will queue a run tonight with boost
>>> disabled.
>>
>> Apologies for the delay. I didn't see any changes with boost-disabled
>> and with reverting the optimization to bail out of the
>> newidle_balance() for SMT and MC domains when there was no task to be
>> pulled from the shared-runq. I reran the whole thing once again, just
>> to rule out any possible variance. The results came out the same.
> 
> Thanks a lot for taking the time to run more benchmarks.
> 
>> With the boost disabled, and the optimization reverted, the results
>> don't change much.
> 
> Hmmm, I see. So, that was the only real substantive "change" between v2
> -> v3. The other changes were supporting hotplug / domain recreation,
> optimizing locking a bit, and fixing small bugs like the return value
> from shared_runq_pick_next_task(), draining the queue when the feature
> is disabled, and fixing the lkp errors.
> 
> With all that said, it seems very possible that the regression is due to
> changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
> multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
> and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") in v3. EEVDF was merged in that window, so that could
> be one explanation for the context switch rate being so much higher.
> 
>> It doesn't appear that the optimization is the cause for increase in
>> the number of load-balancing attempts at the DIE and the NUMA
>> domains. I have shared the counts of the newidle_balance with and
>> without SHARED_RUNQ below for tbench and it can be noticed that the
>> counts are significantly higher for the 64 clients and 128 clients. I
>> also captured the counts/s of find_busiest_group() using funccount.py
>> which tells the same story. So the drop in the performance for tbench
>> with your patches strongly correlates with the increase in
>> load-balancing attempts.
>>
>> newidle balance is undertaken only if the overload flag is set and the
>> expected idle duration is greater than the avg load balancing cost. It
>> is hard to imagine why should the shared runq cause the overload flag
>> to be set!
> 
> Yeah, I'm not sure either about how or why woshared_runq uld cause this
> This is purely hypothetical, but is it possible that shared_runq causes
> idle cores to on average _stay_ idle longer due to other cores pulling
> tasks that would have otherwise been load balanced to those cores?
> 
> Meaning -- say CPU0 is idle, and there are tasks on other rqs which
> could be load balanced. Without shared_runq, CPU0 might be woken up to
> run a task from a periodic load balance. With shared_runq, any active
> core that would otherwise have gone idle could pull the task, keeping
> CPU0 idle.
> 
> What do you think? I could be totally off here.
> 
> From my perspective, I'm not too worried about this given that we're
> seeing gains in other areas such as kernel compile as I showed in [0],
> though I definitely would like to better understand it.

Let me paste a cumulative diff containing everything I've tried since
it'll be easy to explain.

o Performance numbers for tbench 128 clients:

tip			: 1.00 (Var: 0.57%)
tip + vanilla v3	: 0.39 (var: 1.15%) (%diff: -60.74%)
tip + v3 + diff		: 0.99 (var: 0.61%) (%diff: -00.24%)

tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when
cfs bandwidth in use"), same as what Gautham used, so no EEVDF yet.

o Cumulative Diff

Should apply cleanly on top of tip at above commit + this series as is.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -198,7 +198,7 @@ struct shared_runq_shard {
 } ____cacheline_aligned;
 
 /* This would likely work better as a configurable knob via debugfs */
-#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_SHARD_SZ 16
 #define SHARED_RUNQ_MAX_SHARDS \
 	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
 
@@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
 }
 
 static struct task_struct *
-shared_runq_pop_task(struct shared_runq_shard *shard, int target)
+shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
 {
+	int target = cpu_of(rq);
 	struct task_struct *p;
 
 	if (list_empty(&shard->list))
 		return NULL;
 
 	raw_spin_lock(&shard->lock);
+again:
 	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, target))
+
+	/* If we find a task, delete if from list regardless */
+	if (p) {
 		list_del_init(&p->shared_runq_node);
-	else
-		p = NULL;
+
+		if (!task_on_rq_queued(p) ||
+		    task_on_cpu(task_rq(p), p) ||
+		    !is_cpu_allowed(p, target)) {
+			if (rq->ttwu_pending) {
+				p = NULL;
+				goto out;
+			}
+
+			goto again;
+		}
+	}
+
+out:
 	raw_spin_unlock(&shard->lock);
 
 	return p;
@@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 		curr_idx = (starting_idx + i) % num_shards;
 		shard = &shared_runq->shards[curr_idx];
 
-		p = shared_runq_pop_task(shard, cpu_of(rq));
+		p = shared_runq_pop_task(shard, rq);
 		if (p)
 			break;
+
+		if (rq->ttwu_pending)
+			return 0;
 	}
 	if (!p)
 		return 0;
@@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
 		update_rq_clock(src_rq);
 		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
-		ret = 1;
 	}
 
 	if (src_rq != rq) {
 		task_rq_unlock(src_rq, p, &src_rf);
 		raw_spin_rq_lock(rq);
 	} else {
+		ret = 1;
 		rq_unpin_lock(rq, &src_rf);
 		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
 	}
-	rq_repin_lock(rq, rf);
 
 	return ret;
 }
@@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	if (sched_feat(SHARED_RUNQ)) {
-		pulled_task = shared_runq_pick_next_task(this_rq, rf);
-		if (pulled_task)
-			return pulled_task;
-	}
-
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
 	 */
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-
-	/*
-	 * Skip <= LLC domains as they likely won't have any tasks if the
-	 * shared runq is empty.
-	 */
-	if (sched_feat(SHARED_RUNQ)) {
+	if (sched_feat(SHARED_RUNQ))
 		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-		if (likely(sd))
-			sd = sd->parent;
-	}
+	else
+		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		while (sd) {
 			update_next_balance(sd, &next_balance);
+			sd = sd->child;
+		}
+
 		rcu_read_unlock();
 
 		goto out;
 	}
 	rcu_read_unlock();
 
+	t0 = sched_clock_cpu(this_cpu);
+	if (sched_feat(SHARED_RUNQ)) {
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task) {
+			curr_cost = sched_clock_cpu(this_cpu) - t0;
+			update_newidle_cost(sd, curr_cost);
+			goto out_swq;
+		}
+	}
+
+	/* Check again for pending wakeups */
+	if (this_rq->ttwu_pending)
+		return 0;
+
+	t1 = sched_clock_cpu(this_cpu);
+	curr_cost += t1 - t0;
+
+	if (sd)
+		update_newidle_cost(sd, curr_cost);
+
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
@@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 		update_next_balance(sd, &next_balance);
 
+		/*
+		 * Skip <= LLC domains as they likely won't have any tasks if the
+		 * shared runq is empty.
+		 */
+		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
+			continue;
+
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
@@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_lock(this_rq);
 
+out_swq:
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
--

o Breakdown

I'll proceed to annotate a copy of diff with reasoning behind the changes:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -198,7 +198,7 @@ struct shared_runq_shard {
 } ____cacheline_aligned;
 
 /* This would likely work better as a configurable knob via debugfs */
-#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_SHARD_SZ 16
 #define SHARED_RUNQ_MAX_SHARDS \
 	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))

--

	Here I'm setting the SHARED_RUNQ_SHARD_SZ to sd_llc_size for
	my machine. I played around with this and did not see any
	contention for shared_rq lock while running tbench.

--

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
 }
 
 static struct task_struct *
-shared_runq_pop_task(struct shared_runq_shard *shard, int target)
+shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
 {
+	int target = cpu_of(rq);
 	struct task_struct *p;
 
 	if (list_empty(&shard->list))
 		return NULL;
 
 	raw_spin_lock(&shard->lock);
+again:
 	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, target))
+
+	/* If we find a task, delete if from list regardless */
+	if (p) {
 		list_del_init(&p->shared_runq_node);
-	else
-		p = NULL;
+
+		if (!task_on_rq_queued(p) ||
+		    task_on_cpu(task_rq(p), p) ||
+		    !is_cpu_allowed(p, target)) {
+			if (rq->ttwu_pending) {
+				p = NULL;
+				goto out;
+			}
+
+			goto again;
+		}
+	}
+
+out:
 	raw_spin_unlock(&shard->lock);
 
 	return p;
--

	Context: When running perf with IBS, I saw following lock
	contention:

-   12.17%  swapper          [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
   - 10.48% native_queued_spin_lock_slowpath
      - 10.30% _raw_spin_lock
         - 9.11% __schedule
              schedule_idle
              do_idle
            + cpu_startup_entry
         - 0.86% task_rq_lock
              newidle_balance
              pick_next_task_fair
              __schedule
              schedule_idle
              do_idle
            + cpu_startup_entry

	So I imagined the newidle_balance is contending with another
	run_queue going idle when pulling task. Hence, I moved some
	checks in shared_runq_pick_next_task() to here.

	I was not sure if the task's rq lock needs to be held to do this
	to get an accurate result so I've left the original checks in
	shared_runq_pick_next_task() as it is.

	Since retry may be costly, I'm using "rq->ttwu_pending" as a
	bail out threshold. Maybe there are better alternates with
	the lb_cost and rq->avg_idle but this was simpler for now.

	(Realizing as I write this that this will cause more contention
	with enqueue/dequeue in a busy system. I'll check if that is the
	case)

	P.S. This did not affect the ~60% regression I was seeing one
	bit so the problem was deeper.

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 		curr_idx = (starting_idx + i) % num_shards;
 		shard = &shared_runq->shards[curr_idx];
 
-		p = shared_runq_pop_task(shard, cpu_of(rq));
+		p = shared_runq_pop_task(shard, rq);
 		if (p)
 			break;
+
+		if (rq->ttwu_pending)
+			return 0;
 	}
 	if (!p)
 		return 0;
--

	More bailout logic.

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
 		update_rq_clock(src_rq);
 		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
-		ret = 1;
 	}
 
 	if (src_rq != rq) {
 		task_rq_unlock(src_rq, p, &src_rf);
 		raw_spin_rq_lock(rq);
 	} else {
+		ret = 1;
 		rq_unpin_lock(rq, &src_rf);
 		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
 	}
-	rq_repin_lock(rq, rf);
 
 	return ret;
 }
--

	Only return 1 is task is actually pulled else return -1
	signifying the path has released and re-aquired the lock.

	Also leave the rq_repin_lock() part to caller, i.e.,
	newidle_balance() since it makes up for a nice flow (see
	below).

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	if (sched_feat(SHARED_RUNQ)) {
-		pulled_task = shared_runq_pick_next_task(this_rq, rf);
-		if (pulled_task)
-			return pulled_task;
-	}
-
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
 	 */
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-
-	/*
-	 * Skip <= LLC domains as they likely won't have any tasks if the
-	 * shared runq is empty.
-	 */
-	if (sched_feat(SHARED_RUNQ)) {
+	if (sched_feat(SHARED_RUNQ))
 		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-		if (likely(sd))
-			sd = sd->parent;
-	}
+	else
+		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		while (sd) {
 			update_next_balance(sd, &next_balance);
+			sd = sd->child;
+		}
+
 		rcu_read_unlock();
 
 		goto out;
 	}
 	rcu_read_unlock();
 
+	t0 = sched_clock_cpu(this_cpu);
+	if (sched_feat(SHARED_RUNQ)) {
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task) {
+			curr_cost = sched_clock_cpu(this_cpu) - t0;
+			update_newidle_cost(sd, curr_cost);
+			goto out_swq;
+		}
+	}
+
+	/* Check again for pending wakeups */
+	if (this_rq->ttwu_pending)
+		return 0;
+
+	t1 = sched_clock_cpu(this_cpu);
+	curr_cost += t1 - t0;
+
+	if (sd)
+		update_newidle_cost(sd, curr_cost);
+
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
--

	This hunk does a few things:

	1. If a task is successfully pulled from shared rq, or if the rq
	   lock had been released and re-acquired with, jump to the
	   very end where we check a bunch of conditions and return
	   accordingly.

	2. Move the shared rq picking after the "rd->overload" and
	   checks against "rq->avg_idle".

	   P.S. This recovered half the performance that was lost.

	3. Update the newidle_balance_cost via update_newidle_cost()
	   since that is also used to determine the previous bailout
	   threshold.

	4. A bunch of update_next_balance().

	5. Move rq_unpin_lock() below. I do not know the implication of
	   this the kernel is not complaining so far (mind you I'm on
	   x86 and I do not have lockdep enabled)

	A combination of 3 and 4 seemed to give back the other half of
	tbench performance.

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 		update_next_balance(sd, &next_balance);
 
+		/*
+		 * Skip <= LLC domains as they likely won't have any tasks if the
+		 * shared runq is empty.
+		 */
+		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
+			continue;
+
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
--

	This was based on my suggestion in the parallel thread.

	P.S. This alone, without the changes in previous hunk showed no
	difference in performance with results same as vanilla v3.

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..f1e64412fd48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_lock(this_rq);
 
+out_swq:
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
--

	The last part of newidle_balance() does a bunch of accounting
	which is relevant after the above changes. Also the
	rq_repin_lock() I had removed now happens here.

--

Again most of this is lightly tested with just one workload but I would
like to hear your thoughts, especially with the significance of
"rd->overload", "max_newidle_lb_cost", and "update_next_balance()".
however, I'm afraid these may be the bits that led to the drop in
utilization you mentioned in the first place.

Most of the experimentation (except for rq lock contention using IBS)
was done by reading the newidle_balance() code.

Finally a look at newidle_balance counts (tip vs tip + v3 + diff) for
128-clients of tbench on the test machine:


< ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
load_balance cnt on cpu newly idle                         :     921871,   0	(diff: -100.00%)
--
< ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
load_balance cnt on cpu newly idle                         :     472412,   0	(diff: -100.00%)
--
< ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
load_balance cnt on cpu newly idle                         :        114, 279	(diff: +144.74%)
--
< ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
load_balance cnt on cpu newly idle                         :          9,   9	(diff: +00.00%)
--

Let me know if you have any queries. I'll go back and try to bisect the
diff to see if only a couple of changes that I thought were important
are good enought to yield back the lost performance. I'll do wider
testing post hearing your thoughts.

> 
> [..snip..]
> 
> So as I said above, I definitely would like to better understand why
> we're hammering load_balance() so hard in a few different contexts. I'll
> try to repro the issue with tbench on a few different configurations. If
> I'm able to, the next step would be for me to investigate my theory,
> likely by doing something like measuring rq->avg_idle at wakeup time and
> in newidle_balance() using bpftrace. If avg_idle is a lot higher for
> waking cores, maybe my theory isn't too far fetched?
> 
> With all that said, it's been pretty clear from early on in the patch
> set that there were going to be tradeoffs to enabling SHARED_RUNQ. It's
> not surprising to me that there are some configurations that really
> don't tolerate it well, and others that benefit from it a lot. Hackbench
> and kernel compile seem to be two such examples; hackbench especially.
> At Meta, we get really nice gains from it on a few of our biggest
> services. So my hope is that we don't have to tweak every possible use
> case in order for the patch set to be merged, as we've already done a
> lot of due diligence relative to other sched features.
> 
> I would appreciate hearing what others think as well.
> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-30  6:17   ` Chen Yu
@ 2023-08-31  0:01     ` David Vernet
  2023-08-31 10:45       ` Chen Yu
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-31  0:01 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team, tim.c.chen

On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> Hi David,

Hi Chenyu,

Thank you for running these tests, and for your very in-depth analysis
and explanation of the performance you were observing.

> On 2023-08-09 at 17:12:18 -0500, David Vernet wrote:
> > The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
> > tasks are put into on enqueue, and pulled from when a core in that LLC
> > would otherwise go idle. For CPUs with large LLCs, this can sometimes
> > cause significant contention, as illustrated in [0].
> > 
> > [0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> > 
> > So as to try and mitigate this contention, we can instead shard the
> > per-LLC runqueue into multiple per-LLC shards.
> > 
> > While this doesn't outright prevent all contention, it does somewhat mitigate it.
> >
>  
> Thanks for this proposal to make idle load balance more efficient. As we
> dicussed previously, I launched hackbench on Intel Sapphire Rapids
> and I have some findings.
> 
> This platform has 2 sockets, each socket has 56C/112T. To avoid the
> run-run variance, only 1 socket is online, the cpufreq govenor is set to
> performance, the turbo is disabled, and C-states deeper than C1 are disabled.
> 
> hackbench
> =========
> case                    load            baseline(std%)  compare%( std%)
> process-pipe            1-groups         1.00 (  1.09)   +0.55 (  0.20)
> process-pipe            2-groups         1.00 (  0.60)   +3.57 (  0.28)
> process-pipe            4-groups         1.00 (  0.30)   +5.22 (  0.26)
> process-pipe            8-groups         1.00 (  0.10)  +43.96 (  0.26)
> process-sockets         1-groups         1.00 (  0.18)   -1.56 (  0.34)
> process-sockets         2-groups         1.00 (  1.06)  -12.37 (  0.11)
> process-sockets         4-groups         1.00 (  0.29)   +0.21 (  0.19)
> process-sockets         8-groups         1.00 (  0.06)   +3.59 (  0.39)
> 
> The 8 groups pipe mode has an impressive improvement, while the 2 groups sockets
> mode did see some regressions.
> 
> The possible reason to cause the regression is at the end of this reply, in
> case you want to see the conclusion directly : )

I read through everything, and it all made sense. I'll reply to your
conclusion below.

> To investigate the regression, I did slight hack on the hackbench, by renaming
> the workload to sender and receiver.
> 
> When it is in 2 groups mode, there would be 2 groups of senders and receivers.
> Each group has 14 senders and 14 receivers. So there are totally 56 tasks running
> on 112 CPUs. In each group, sender_i sends package to receiver_j  ( i, j belong to [1,14] )
> 
> 
> 1. Firstly use 'top' to monitor the CPU utilization:
> 
>    When shared_runqueue is disabled, many CPUs are 100%, while the other
>    CPUs remain 0%.
>    When shared_runqueue is enabled, most CPUs are busy and the utilization is
>    in 40%~60%.
> 
>    This means that shared_runqueue works as expected.
> 
> 2. Then the bpf wakeup latency is monitored:
> 
> tracepoint:sched:sched_wakeup,
> tracepoint:sched:sched_wakeup_new
> {
>         if (args->comm == "sender") {
>                 @qstime[args->pid] = nsecs;
>         }
>         if (args->comm == "receiver") {
>                 @qrtime[args->pid] = nsecs;
>         }
> }
> 
> tracepoint:sched:sched_switch
> {
>         if (args->next_comm == "sender") {
>                 $ns = @qstime[args->next_pid];
>                 if ($ns) {
>                         @sender_wakeup_lat = hist((nsecs - $ns) / 1000);
>                         delete(@qstime[args->next_pid]);
>                 }
>         }
>         if (args->next_comm == "receiver") {
>                 $ns = @qrtime[args->next_pid];
>                 if ($ns) {
>                         @receiver_wakeup_lat = hist((nsecs - $ns) / 1000);
>                         delete(@qstime[args->next_pid]);
>                 }
>         }
> }
> 
> 
> It shows that, the wakeup latency of the receiver has been increased a little
> bit. But consider that this symptom is the same when the hackbench is in pipe mode,
> and there is no regression in pipe mode, the wakeup latency overhead might not be
> the cause of the regression.
> 
> 3. Then FlameGraph is used to compare the bottleneck.
> There is still no obvious difference noticed. One obvious bottleneck is the atomic
> write to a memory cgroup page count(and runqueue lock contention is not observed).
> The backtrace:
> 
> obj_cgroup_charge_pages;obj_cgroup_charge;__kmem_cache_alloc_node;
> __kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;
> alloc_skb_with_frags;sock_alloc_send_pskb;unix_stream_sendmsg
> 
> However there is no obvious ratio difference between with/without shared runqueue
> enabled. So this one might not be the cause.
> 
> 4. Check the wakeup task migration count
> 
> Borrow the script from Aaron:
> kretfunc:select_task_rq_fair
> {
>         $p = (struct task_struct *)args->p;
>         if ($p->comm == "sender") {
>                 if ($p->thread_info.cpu != retval) {
>                         @wakeup_migrate_sender = count();
>                 } else {
>                         @wakeup_prev_sender = count();
>                 }
>         }
>         if ($p->comm == "receiver") {
>                 if ($p->thread_info.cpu != retval) {
>                         @wakeup_migrate_receiver = count();
>                 } else {
>                         @wakeup_prev_receiver = count();
>                 }
>         }
> }
> 
> Without shared_runqueue enabled, the wakee task are mostly woken up on it
> previous running CPUs.
> With shared_runqueue disabled, the wakee task are mostly woken up on a
> completely different idle CPUs.
> 
> This reminds me that, is it possible the regression was caused by the broken
> cache locallity?
> 
> 
> 5. Check the L2 cache miss rate.
> perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> The results show that the L2 cache miss rate is nearly the same with/without
> shared_runqueue enabled.

As mentioned below, I expect it would be interesting to also collect
icache / iTLB numbers. In my experience, poor uop cache locality will
also result in poor icache locality, though of course that depends on a
lot of other factors like alignment, how many (un)conditional branches
you have within some byte window, etc. If alignment, etc were the issue
though, we'd likely observe this also without SHARED_RUNQ.

> I did not check the L3 miss rate, because:
>    3.1 there is only 1 socket of CPUs online
>    3.2 the working set the hackbench is 56 * 100 * 300000, which is nearly
>        the same as LLC cache size.
> 
> 6. As mentioned in step 3, the bottleneck is a atomic write to a global
>    variable. Then use perf c2c to check if there is any false/true sharing.
> 
>    According to the result, the total number and average cycles of local HITM
>    is low.So this might indicate that this is not a false sharing or true
>    sharing issue.
> 
> 
> 7. Then use perf topdown to dig into the detail. The methodology is at
>    https://perf.wiki.kernel.org/index.php/Top-Down_Analysis
> 
> 
>    When shared runqueue is disabled:
> 
>     #     65.2 %  tma_backend_bound
>     #      2.9 %  tma_bad_speculation
>     #     13.1 %  tma_frontend_bound
>     #     18.7 %  tma_retiring
> 
> 
> 
>    When shared runqueue is enabled:
> 
>     #     52.4 %  tma_backend_bound
>     #      3.3 %  tma_bad_speculation
>     #     20.5 %  tma_frontend_bound
>     #     23.8 %  tma_retiring
>     
> 
> We can see that, the ratio of frontend_bound has increased from 13.1% to
> 20.5%.  As a comparison, this ratio does not increase when the hackbench
> is in pipe mode.
> 
> Then further dig into the deeper level of frontend_bound:
> 
> When shared runqueue is disabled:
> #      6.9 %  tma_fetch_latency   ---->  #      7.3 %  tma_ms_switches
>                                   |
>                                   ---->  #      7.1 %  tma_dsb_switches
> 
> 
> When shared runqueue is enabled:
> #     11.6 %  tma_fetch_latency   ----> #      6.7 %  tma_ms_switches
>                                   |
>                                   ----> #      7.8 %  tma_dsb_switches
> 
> 
> 1. The DSB(Decode Stream Buffer) switches count increases
>    from 13.1% * 6.9% * 7.1% to 20.5% * 11.6% * 7.8%

Indeed, these switches are quite costly from what I understand.

> 2. The MS(Microcode Sequencer) switches count increases
>    from 13.1% * 6.9% * 7.3% to 20.5% * 11.6% * 6.7%
> 
> DSB is the cached decoded uops, which is similar to L1 icache,
> except that icache has the original instructions, while DSB has the
> decoded one. DSB reflects the instruction footprint. The increase
> of DSB switches mean that, the cached buffer has been thrashed a lot.
> 
> MS is to decode the complex instructions, the increase of MS switch counter
> usually means that the workload is running some complex instruction.
> that the workload is running complex instructions.
> 
> In summary:
> 
> So the scenario to cause this issue I'm thinking of is:
> Task migration increases the DSB switches count. With shared_runqueue enabled,
> the task could be migrated to different CPUs more offen. And it has to fill its
> new uops into the DSB, but that DSB has already been filled by the old task's
> uops. So DSB switches is triggered to decode the new macro ops. This is usually
> not a problem if the workload runs some simple instructions. However if
> this workload's instruction footprint increases, task migration might break
> the DSB uops locality, which is similar to L1/L2 cache locality.

Interesting. As mentioned above, I expect we also see an increase in
iTLB and icache misses?

This is something we deal with in HHVM. Like any other JIT engine /
compiler, it is heavily front-end CPU bound, and has very poor icache,
iTLB, and uop cache locality (also lots of branch resteers, etc).
SHARED_RUNQ actually helps this workload quite a lot, as explained in
the cover letter for the patch series. It makes sense that it would: uop
locality is really bad even without increasing CPU util. So we have no
reason not to migrate the task and hop on a CPU.

> I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> on the per-LLC shared runqueue? In this way it will not be migrated too offen
> so as to keep its locality(both in terms of L1/L2 cache and DSB).

I'm hesitant to apply such heuristics to the feature. As mentioned
above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
locality in general even when they're scheduled on the same core they
were on before they were descheduled, so we observe better performance
if the task is migrated to a fully idle core rather than e.g. its
hypertwin if it's available. That's not something we can guarantee with
SHARED_RUNQ, but it hopefully illustrates the point that it's an example
of a workload that would suffer with such a heuristic.

Another point to consider is that performance implications that are a
result of Intel micro architectural details don't necessarily apply to
everyone. I'm not as familiar with the instruction decode pipeline on
AMD chips like Zen4. I'm sure they have a uop cache, but the size of
that cache, alignment requirements, the way that cache interfaces with
e.g. their version of the MITE / decoder, etc, are all going to be quite
different.

In general, I think it's difficult for heuristics like this to suit all
possible workloads or situations (not that you're claiming it is). My
preference is to keep it as is so that it's easier for users to build a
mental model of what outcome they should expect if they use the feature.
Put another way: As a user of this feature, I'd be a lot more surprised
to see that I enabled it and CPU util stayed low, vs. enabling it and
seeing higher CPU util, but also degraded icache / iTLB locality.

Let me know what you think, and thanks again for investing your time
into this.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-30  6:46   ` K Prateek Nayak
@ 2023-08-31  1:34     ` David Vernet
  2023-08-31  3:47       ` K Prateek Nayak
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-31  1:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

On Wed, Aug 30, 2023 at 12:16:17PM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

> 
> On 8/10/2023 3:42 AM, David Vernet wrote:
> > [..snip..]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2aab7be46f7e..8238069fd852 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -769,6 +769,8 @@ struct task_struct {
> >  	unsigned long			wakee_flip_decay_ts;
> >  	struct task_struct		*last_wakee;
> >  
> > +	struct list_head		shared_runq_node;
> > +
> >  	/*
> >  	 * recent_used_cpu is initially set as the last CPU used by a task
> >  	 * that wakes affine another task. Waker/wakee relationships can
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 385c565da87f..fb7e71d3dc0a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4529,6 +4529,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >  #ifdef CONFIG_SMP
> >  	p->wake_entry.u_flags = CSD_TYPE_TTWU;
> >  	p->migration_pending = NULL;
> > +	INIT_LIST_HEAD(&p->shared_runq_node);
> >  #endif
> >  	init_sched_mm_cid(p);
> >  }
> > @@ -9764,6 +9765,18 @@ int sched_cpu_deactivate(unsigned int cpu)
> >  	return 0;
> >  }
> >  
> > +void sched_update_domains(void)
> > +{
> > +	const struct sched_class *class;
> > +
> > +	update_sched_domain_debugfs();
> > +
> > +	for_each_class(class) {
> > +		if (class->update_domains)
> > +			class->update_domains();
> > +	}
> > +}
> > +
> >  static void sched_rq_cpu_starting(unsigned int cpu)
> >  {
> >  	struct rq *rq = cpu_rq(cpu);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9c23e3b948fc..6e740f8da578 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -139,20 +139,235 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> >  }
> >  __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> >  
> > +/**
> > + * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
> > + * runnable tasks within an LLC.
> > + *
> > + * WHAT
> > + * ====
> > + *
> > + * This structure enables the scheduler to be more aggressively work
> > + * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
> > + * pulled from when another core in the LLC is going to go idle.
> > + *
> > + * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
> > + * Waking tasks are enqueued in the calling CPU's struct shared_runq in
> > + * __enqueue_entity(), and are opportunistically pulled from the shared_runq
> > + * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
> > + * to being pulled from the shared_runq, in which case they're simply dequeued
> > + * from the shared_runq in __dequeue_entity().
> > + *
> > + * There is currently no task-stealing between shared_runqs in different LLCs,
> > + * which means that shared_runq is not fully work conserving. This could be
> > + * added at a later time, with tasks likely only being stolen across
> > + * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
> > + *
> > + * HOW
> > + * ===
> > + *
> > + * A shared_runq is comprised of a list, and a spinlock for synchronization.
> > + * Given that the critical section for a shared_runq is typically a fast list
> > + * operation, and that the shared_runq is localized to a single LLC, the
> > + * spinlock will typically only be contended on workloads that do little else
> > + * other than hammer the runqueue.
> > + *
> > + * WHY
> > + * ===
> > + *
> > + * As mentioned above, the main benefit of shared_runq is that it enables more
> > + * aggressive work conservation in the scheduler. This can benefit workloads
> > + * that benefit more from CPU utilization than from L1/L2 cache locality.
> > + *
> > + * shared_runqs are segmented across LLCs both to avoid contention on the
> > + * shared_runq spinlock by minimizing the number of CPUs that could contend on
> > + * it, as well as to strike a balance between work conservation, and L3 cache
> > + * locality.
> > + */
> > +struct shared_runq {
> > +	struct list_head list;
> > +	raw_spinlock_t lock;
> > +} ____cacheline_aligned;
> > +
> >  #ifdef CONFIG_SMP
> > +
> > +static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
> > +
> > +static struct shared_runq *rq_shared_runq(struct rq *rq)
> > +{
> > +	return rq->cfs.shared_runq;
> > +}
> > +
> > +static void shared_runq_reassign_domains(void)
> > +{
> > +	int i;
> > +	struct shared_runq *shared_runq;
> > +	struct rq *rq;
> > +	struct rq_flags rf;
> > +
> > +	for_each_possible_cpu(i) {
> > +		rq = cpu_rq(i);
> > +		shared_runq = &per_cpu(shared_runqs, per_cpu(sd_llc_id, i));
> > +
> > +		rq_lock(rq, &rf);
> > +		rq->cfs.shared_runq = shared_runq;
> > +		rq_unlock(rq, &rf);
> > +	}
> > +}
> > +
> > +static void __shared_runq_drain(struct shared_runq *shared_runq)
> > +{
> > +	struct task_struct *p, *tmp;
> > +
> > +	raw_spin_lock(&shared_runq->lock);
> > +	list_for_each_entry_safe(p, tmp, &shared_runq->list, shared_runq_node)
> > +		list_del_init(&p->shared_runq_node);
> > +	raw_spin_unlock(&shared_runq->lock);
> > +}
> > +
> > +static void update_domains_fair(void)
> > +{
> > +	int i;
> > +	struct shared_runq *shared_runq;
> > +
> > +	/* Avoid racing with SHARED_RUNQ enable / disable. */
> > +	lockdep_assert_cpus_held();
> > +
> > +	shared_runq_reassign_domains();
> > +
> > +	/* Ensure every core sees its updated shared_runq pointers. */
> > +	synchronize_rcu();
> > +
> > +	/*
> > +	 * Drain all tasks from all shared_runq's to ensure there are no stale
> > +	 * tasks in any prior domain runq. This can cause us to drain live
> > +	 * tasks that would otherwise have been safe to schedule, but this
> > +	 * isn't a practical problem given how infrequently domains are
> > +	 * rebuilt.
> > +	 */
> > +	for_each_possible_cpu(i) {
> > +		shared_runq = &per_cpu(shared_runqs, i);
> > +		__shared_runq_drain(shared_runq);
> > +	}
> > +}
> > +
> >  void shared_runq_toggle(bool enabling)
> > -{}
> > +{
> > +	int cpu;
> > +
> > +	if (enabling)
> > +		return;
> > +
> > +	/* Avoid racing with hotplug. */
> > +	lockdep_assert_cpus_held();
> > +
> > +	/* Ensure all cores have stopped enqueueing / dequeuing tasks. */
> > +	synchronize_rcu();
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		int sd_id;
> > +
> > +		sd_id = per_cpu(sd_llc_id, cpu);
> > +		if (cpu == sd_id)
> > +			__shared_runq_drain(rq_shared_runq(cpu_rq(cpu)));
> > +	}
> > +}
> > +
> > +static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +{
> > +	struct task_struct *p;
> > +	struct shared_runq *shared_runq;
> > +
> > +	shared_runq = rq_shared_runq(rq);
> > +	if (list_empty(&shared_runq->list))
> > +		return NULL;
> > +
> > +	raw_spin_lock(&shared_runq->lock);
> > +	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
> > +				     shared_runq_node);
> > +	if (p && is_cpu_allowed(p, cpu_of(rq)))
> > +		list_del_init(&p->shared_runq_node);
> 
> I wonder if we should remove the task from the list if
> "is_cpu_allowed()" return false.
> 
> Consider the following scenario: A task that does not sleep, is pinned
> to single CPU. Since this is now at the head of the list, and cannot be
> moved, we leave it there, but since the task also never sleeps, it'll
> stay there, thus preventing the queue from doing its job.

Hmm, sorry, I may not be understanding your suggestion. If a task was
pinned to a single CPU, it would be dequeued from the shared_runq before
being pinned (see __set_cpus_allowed_ptr_locked()), and then would not
be added back to the shard in shared_runq_enqueue_task() because of
p->nr_cpus_allowed == 1. The task would also be dequeued from the shard
before it started running (unless I'm misunderstanding what you mean by
"a task that does not sleep"). Please let me know if I'm missing
something.

> Further implication ...  
> 
> > +	else
> > +		p = NULL;
> > +	raw_spin_unlock(&shared_runq->lock);
> > +
> > +	return p;
> > +}
> > +
> > +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct shared_runq *shared_runq;
> > +
> > +	shared_runq = rq_shared_runq(rq);
> > +	raw_spin_lock(&shared_runq->lock);
> > +	list_add_tail(&p->shared_runq_node, &shared_runq->list);
> > +	raw_spin_unlock(&shared_runq->lock);
> > +}
> >  
> >  static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
> > -{}
> > +{
> > +	/*
> > +	 * Only enqueue the task in the shared runqueue if:
> > +	 *
> > +	 * - SHARED_RUNQ is enabled
> > +	 * - The task isn't pinned to a specific CPU
> > +	 */
> > +	if (p->nr_cpus_allowed == 1)
> > +		return;
> > +
> > +	shared_runq_push_task(rq, p);
> > +}
> >  
> >  static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
> >  {
> > -	return 0;
> > +	struct task_struct *p = NULL;
> > +	struct rq *src_rq;
> > +	struct rq_flags src_rf;
> > +	int ret = -1;
> > +
> > +	p = shared_runq_pop_task(rq);
> > +	if (!p)
> > +		return 0;
> 
> ...
> 
> Since we return 0 here in such a scenario, we'll take the old
> newidle_balance() path but ...
> 
> > +
> > +	rq_unpin_lock(rq, rf);
> > +	raw_spin_rq_unlock(rq);
> > +
> > +	src_rq = task_rq_lock(p, &src_rf);
> > +
> > +	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
> > +		update_rq_clock(src_rq);
> > +		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> > +		ret = 1;
> > +	}
> > +
> > +	if (src_rq != rq) {
> > +		task_rq_unlock(src_rq, p, &src_rf);
> > +		raw_spin_rq_lock(rq);
> > +	} else {
> > +		rq_unpin_lock(rq, &src_rf);
> > +		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
> > +	}
> > +	rq_repin_lock(rq, rf);
> > +
> > +	return ret;
> >  }
> >  
> >  static void shared_runq_dequeue_task(struct task_struct *p)
> > -{}
> > +{
> > +	struct shared_runq *shared_runq;
> > +
> > +	if (!list_empty(&p->shared_runq_node)) {
> > +		shared_runq = rq_shared_runq(task_rq(p));
> > +		raw_spin_lock(&shared_runq->lock);
> > +		/*
> > +		 * Need to double-check for the list being empty to avoid
> > +		 * racing with the list being drained on the domain recreation
> > +		 * or SHARED_RUNQ feature enable / disable path.
> > +		 */
> > +		if (likely(!list_empty(&p->shared_runq_node)))
> > +			list_del_init(&p->shared_runq_node);
> > +		raw_spin_unlock(&shared_runq->lock);
> > +	}
> > +}
> >  
> >  /*
> >   * For asym packing, by default the lower numbered CPU has higher priority.
> > @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >  	rcu_read_lock();
> >  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >  
> > +	/*
> > +	 * Skip <= LLC domains as they likely won't have any tasks if the
> > +	 * shared runq is empty.
> > +	 */
> 
> ... now we skip all the way ahead of MC domain, overlooking any
> imbalance that might still exist within the SMT and MC groups
> since shared runq is not exactly empty.
> 
> Let me know if I've got something wrong!

Yep, I mentioned this to Gautham as well in [0].

[0]: https://lore.kernel.org/all/20230818050355.GA5718@maniforge/

I agree that I think we should remove this heuristic from v4. Either
that, or add logic to iterate over the shared_runq until a viable task
is found.

> 
> > +	if (sched_feat(SHARED_RUNQ)) {
> > +		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> > +		if (likely(sd))
> > +			sd = sd->parent;
> > +	}
> 
> Speaking of skipping ahead of MC domain, I don't think this actually
> works since the domain traversal uses the "for_each_domain" macro
> which is defined as:

*blinks*

Uhhh, yeah, wow. Good catch!

> #define for_each_domain(cpu, __sd) \
> 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
> 			__sd; __sd = __sd->parent)
> 
> The traversal starts from rq->sd overwriting your initialized value
> here. This is why we see "load_balance count on cpu newly idle" in
> Gautham's first report
> (https://lore.kernel.org/lkml/ZN3dW5Gvcb0LFWjs@BLR-5CG11610CF.amd.com/)
> to be non-zero.
>
> One way to do this would be as follows:
> 
> static int newidle_balance() {
> 
> 	...
> 	for_each_domain(this_cpu, sd) {
> 
> 		...
> 		/* Skip balancing until LLc domain */
> 		if (sched_feat(SHARED_RUNQ) &&
> 		    (sd->flags & SD_SHARE_PKG_RESOURCES))
> 			continue;
> 
> 		...
> 	}
> 	...
> }

Yep, I think this makes sense to do.

> With this I see the newidle balance count for SMT and MC domain
> to be zero:

And indeed, I think this was the intention. Thanks again for catching
this. I'm excited to try this out when running benchmarks for v4.

> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
> --
> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
> --
> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :       2170    $      9.319 $    [   17.42832 ]
> --
> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :         30    $    674.067 $    [    0.24094 ]
> --
> 
> Let me know if I'm missing something here :)

No, I think you're correct, we should be doing this. Assuming we want to
keep this heuristic, I think the block above is also correct so that we
properly account sd->max_newidle_lb_cost and rq->next_balance. Does that
make sense to you too?

> 
> Note: The lb counts for DIE and NUMA are down since I'm experimenting
> with the implementation currently. I'll update of any new findings on
> the thread.

Ack, thank you for doing that.

Just FYI, I'll be on vacation for about 1.5 weeks starting tomorrow
afternoon. If I'm slow to respond, that's why.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-30  9:56           ` K Prateek Nayak
@ 2023-08-31  2:32             ` David Vernet
  2023-08-31  4:21               ` K Prateek Nayak
  2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
  1 sibling, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-31  2:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Gautham R. Shenoy, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, aaron.lu, wuyun.abel,
	kernel-team

On Wed, Aug 30, 2023 at 03:26:40PM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

> 
> Short update based on some of my experimentation.
> 
> Disclaimer: I've been only looking at tbench 128 client case on a dual
> socket 3rd Generation EPYC system (2x 64C/128T). Wider results may
> vary but I have some information that may help with the debug and to
> proceed further.
> 
> On 8/25/2023 4:21 AM, David Vernet wrote:
> > On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
> >> Hello David,
> >>
> >> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
> >>> Hello David,
> >>>
> >>> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> >>>> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> >>>>> Hello David,
> >>>>
> >>>> Hello Gautham,
> >>>>
> >>>> Thanks a lot as always for running some benchmarks and analyzing these
> >>>> changes.
> >>>>
> >>>>> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> >>>>>> Changes
> >>>>>> -------
> >>>>>>
> >>>>>> This is v3 of the shared runqueue patchset. This patch set is based off
> >>>>>> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> >>>>>> bandwidth in use") on the sched/core branch of tip.git.
> >>>>>
> >>>>>
> >>>>> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> >>>>> notice that apart from hackbench, every other bechmark is showing
> >>>>> regressions with this patch series. Quick summary of my observations:
> >>>>
> >>>> Just to verify per our prior conversation [0], was this latest set of
> >>>> benchmarks run with boost disabled?
> >>>
> >>> Boost is enabled by default. I will queue a run tonight with boost
> >>> disabled.
> >>
> >> Apologies for the delay. I didn't see any changes with boost-disabled
> >> and with reverting the optimization to bail out of the
> >> newidle_balance() for SMT and MC domains when there was no task to be
> >> pulled from the shared-runq. I reran the whole thing once again, just
> >> to rule out any possible variance. The results came out the same.
> > 
> > Thanks a lot for taking the time to run more benchmarks.
> > 
> >> With the boost disabled, and the optimization reverted, the results
> >> don't change much.
> > 
> > Hmmm, I see. So, that was the only real substantive "change" between v2
> > -> v3. The other changes were supporting hotplug / domain recreation,
> > optimizing locking a bit, and fixing small bugs like the return value
> > from shared_runq_pick_next_task(), draining the queue when the feature
> > is disabled, and fixing the lkp errors.
> > 
> > With all that said, it seems very possible that the regression is due to
> > changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
> > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
> > and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") in v3. EEVDF was merged in that window, so that could
> > be one explanation for the context switch rate being so much higher.
> > 
> >> It doesn't appear that the optimization is the cause for increase in
> >> the number of load-balancing attempts at the DIE and the NUMA
> >> domains. I have shared the counts of the newidle_balance with and
> >> without SHARED_RUNQ below for tbench and it can be noticed that the
> >> counts are significantly higher for the 64 clients and 128 clients. I
> >> also captured the counts/s of find_busiest_group() using funccount.py
> >> which tells the same story. So the drop in the performance for tbench
> >> with your patches strongly correlates with the increase in
> >> load-balancing attempts.
> >>
> >> newidle balance is undertaken only if the overload flag is set and the
> >> expected idle duration is greater than the avg load balancing cost. It
> >> is hard to imagine why should the shared runq cause the overload flag
> >> to be set!
> > 
> > Yeah, I'm not sure either about how or why woshared_runq uld cause this
> > This is purely hypothetical, but is it possible that shared_runq causes
> > idle cores to on average _stay_ idle longer due to other cores pulling
> > tasks that would have otherwise been load balanced to those cores?
> > 
> > Meaning -- say CPU0 is idle, and there are tasks on other rqs which
> > could be load balanced. Without shared_runq, CPU0 might be woken up to
> > run a task from a periodic load balance. With shared_runq, any active
> > core that would otherwise have gone idle could pull the task, keeping
> > CPU0 idle.
> > 
> > What do you think? I could be totally off here.
> > 
> > From my perspective, I'm not too worried about this given that we're
> > seeing gains in other areas such as kernel compile as I showed in [0],
> > though I definitely would like to better understand it.
> 
> Let me paste a cumulative diff containing everything I've tried since
> it'll be easy to explain.
> 
> o Performance numbers for tbench 128 clients:
> 
> tip			: 1.00 (Var: 0.57%)
> tip + vanilla v3	: 0.39 (var: 1.15%) (%diff: -60.74%)
> tip + v3 + diff		: 0.99 (var: 0.61%) (%diff: -00.24%)
> 
> tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when
> cfs bandwidth in use"), same as what Gautham used, so no EEVDF yet.
> 
> o Cumulative Diff
> 
> Should apply cleanly on top of tip at above commit + this series as is.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> -#define SHARED_RUNQ_SHARD_SZ 6
> +#define SHARED_RUNQ_SHARD_SZ 16
>  #define SHARED_RUNQ_MAX_SHARDS \
>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>  
> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>  }
>  
>  static struct task_struct *
> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>  {
> +	int target = cpu_of(rq);
>  	struct task_struct *p;
>  
>  	if (list_empty(&shard->list))
>  		return NULL;
>  
>  	raw_spin_lock(&shard->lock);
> +again:
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +
> +	/* If we find a task, delete if from list regardless */
> +	if (p) {
>  		list_del_init(&p->shared_runq_node);
> -	else
> -		p = NULL;
> +
> +		if (!task_on_rq_queued(p) ||
> +		    task_on_cpu(task_rq(p), p) ||

Have you observed !task_on_rq_queued() or task_on_cpu() returning true
here? The task should have removed itself from the shard when
__dequeue_entity() is called from set_next_entity() when it's scheduled
in pick_next_task_fair(). The reason we have to check in
shared_runq_pick_next_task() is that between dequeuing the task from the
shared_runq and getting its rq lock, it could have been scheduled on its
current rq. But if the task was scheduled first, it should have removed
itself from the shard.

> +		    !is_cpu_allowed(p, target)) {
> +			if (rq->ttwu_pending) {
> +				p = NULL;
> +				goto out;
> +			}

Have you observed this as well? If the task is enqueued on the ttwu
queue wakelist, it isn't enqueued on the waking CPU, so it shouldn't be
added to the shared_runq right?

> +
> +			goto again;
> +		}
> +	}
> +
> +out:
>  	raw_spin_unlock(&shard->lock);
>  
>  	return p;
> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		curr_idx = (starting_idx + i) % num_shards;
>  		shard = &shared_runq->shards[curr_idx];
>  
> -		p = shared_runq_pop_task(shard, cpu_of(rq));
> +		p = shared_runq_pop_task(shard, rq);
>  		if (p)
>  			break;
> +
> +		if (rq->ttwu_pending)
> +			return 0;

Same here r.e. rq->ttwu_pending. This should be handled in the

if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p))

check below, no? Note that task_on_rq_queued(p) should only return true
if the task has made it to ttwu_do_activate(), and if it hasn't, I don't
think it should be in the shard in the first place.

>  	}
>  	if (!p)
>  		return 0;
> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>  		update_rq_clock(src_rq);
>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> -		ret = 1;
>  	}
>  
>  	if (src_rq != rq) {
>  		task_rq_unlock(src_rq, p, &src_rf);
>  		raw_spin_rq_lock(rq);
>  	} else {
> +		ret = 1;
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);

Huh, wouldn't this cause a WARN to be issued the next time we invoke
rq_clock() in newidle_balance() if we weren't able to find a task? Or
was it because we moved the SHARED_RUNQ logic to below where we check
rq_clock()? In general though, I don't think this should be removed. At
the very least, it should be tested with lockdep.

>  	return ret;
>  }
> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
>  	rcu_read_unlock();
>  
> +	t0 = sched_clock_cpu(this_cpu);
> +	if (sched_feat(SHARED_RUNQ)) {
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
> +			update_newidle_cost(sd, curr_cost);
> +			goto out_swq;
> +		}
> +	}

Hmmm, why did you move this further down in newidle_balance()? We don't
want to skip trying to get a task from the shared_runq if rq->avg_idle <
sd->max_newidle_lb_cost.

> +
> +	/* Check again for pending wakeups */
> +	if (this_rq->ttwu_pending)
> +		return 0;
> +
> +	t1 = sched_clock_cpu(this_cpu);
> +	curr_cost += t1 - t0;
> +
> +	if (sd)
> +		update_newidle_cost(sd, curr_cost);
> +
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);
>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  		update_next_balance(sd, &next_balance);
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;

This makes sense to me, good call.

> +
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  
> --
> 
> o Breakdown
> 
> I'll proceed to annotate a copy of diff with reasoning behind the changes:

Ah, ok, you provided explanations :-) I'll leave my questions above
regardless for posterity.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> -#define SHARED_RUNQ_SHARD_SZ 6
> +#define SHARED_RUNQ_SHARD_SZ 16
>  #define SHARED_RUNQ_MAX_SHARDS \
>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
> 
> --
> 
> 	Here I'm setting the SHARED_RUNQ_SHARD_SZ to sd_llc_size for
> 	my machine. I played around with this and did not see any
> 	contention for shared_rq lock while running tbench.

I don't really mind making the shard bigger because it will never be one
size fits all, but for what it's worth, I saw less contention in netperf
with a size of 6, but everything else performed fine with a larger
shard. This is one of those "there's no right answer" things, I'm
afraid. I think it will be inevitable to make this configurable at some
point, if we find that it's really causing inefficiencies.

> --
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>  }
>  
>  static struct task_struct *
> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>  {
> +	int target = cpu_of(rq);
>  	struct task_struct *p;
>  
>  	if (list_empty(&shard->list))
>  		return NULL;
>  
>  	raw_spin_lock(&shard->lock);
> +again:
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +
> +	/* If we find a task, delete if from list regardless */

As I mentioned in my other reply [0], I don't think we should always
have to delete here. Let me know if I'm missing something.

[0]: https://lore.kernel.org/all/20230831013435.GB506447@maniforge/

> +	if (p) {
>  		list_del_init(&p->shared_runq_node);
> -	else
> -		p = NULL;
> +
> +		if (!task_on_rq_queued(p) ||
> +		    task_on_cpu(task_rq(p), p) ||
> +		    !is_cpu_allowed(p, target)) {
> +			if (rq->ttwu_pending) {
> +				p = NULL;
> +				goto out;
> +			}
> +
> +			goto again;
> +		}
> +	}
> +
> +out:
>  	raw_spin_unlock(&shard->lock);
>  
>  	return p;
> --
> 
> 	Context: When running perf with IBS, I saw following lock
> 	contention:
> 
> -   12.17%  swapper          [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>    - 10.48% native_queued_spin_lock_slowpath
>       - 10.30% _raw_spin_lock
>          - 9.11% __schedule
>               schedule_idle
>               do_idle
>             + cpu_startup_entry
>          - 0.86% task_rq_lock
>               newidle_balance
>               pick_next_task_fair
>               __schedule
>               schedule_idle
>               do_idle
>             + cpu_startup_entry
> 
> 	So I imagined the newidle_balance is contending with another
> 	run_queue going idle when pulling task. Hence, I moved some
> 	checks in shared_runq_pick_next_task() to here.

Hmm, so the idea was to avoid contending on the rq lock? As I mentioned
above, I'm not sure these checks actually buy us anything.

> 	I was not sure if the task's rq lock needs to be held to do this
> 	to get an accurate result so I've left the original checks in
> 	shared_runq_pick_next_task() as it is.

Yep, we need to have the rq lock held for these functions to return
consistent data.

> 	Since retry may be costly, I'm using "rq->ttwu_pending" as a
> 	bail out threshold. Maybe there are better alternates with
> 	the lb_cost and rq->avg_idle but this was simpler for now.

Hmm, not sure I'm quite understanding this one. As I mentioned above, I
don't _think_ this should ever be set for a task enqueued in a shard.
Were you observing that?

> 	(Realizing as I write this that this will cause more contention
> 	with enqueue/dequeue in a busy system. I'll check if that is the
> 	case)
> 
> 	P.S. This did not affect the ~60% regression I was seeing one
> 	bit so the problem was deeper.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		curr_idx = (starting_idx + i) % num_shards;
>  		shard = &shared_runq->shards[curr_idx];
>  
> -		p = shared_runq_pop_task(shard, cpu_of(rq));
> +		p = shared_runq_pop_task(shard, rq);
>  		if (p)
>  			break;
> +
> +		if (rq->ttwu_pending)
> +			return 0;
>  	}
>  	if (!p)
>  		return 0;
> --
> 
> 	More bailout logic.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>  		update_rq_clock(src_rq);
>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
> -		ret = 1;
>  	}
>  
>  	if (src_rq != rq) {
>  		task_rq_unlock(src_rq, p, &src_rf);
>  		raw_spin_rq_lock(rq);
>  	} else {
> +		ret = 1;
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);
>  
>  	return ret;
>  }
> --
> 
> 	Only return 1 is task is actually pulled else return -1
> 	signifying the path has released and re-aquired the lock.

Not sure I'm following. If we migrate the task to the current rq, we
want to return 1 to signify that there are new fair tasks present in the
rq don't we? It doesn't need to have started there originally for it to
be present after we move it.

> 
> 	Also leave the rq_repin_lock() part to caller, i.e.,
> 	newidle_balance() since it makes up for a nice flow (see
> 	below).
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
>  	rcu_read_unlock();
>  
> +	t0 = sched_clock_cpu(this_cpu);
> +	if (sched_feat(SHARED_RUNQ)) {
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
> +			update_newidle_cost(sd, curr_cost);
> +			goto out_swq;
> +		}
> +	}
> +
> +	/* Check again for pending wakeups */
> +	if (this_rq->ttwu_pending)
> +		return 0;
> +
> +	t1 = sched_clock_cpu(this_cpu);
> +	curr_cost += t1 - t0;
> +
> +	if (sd)
> +		update_newidle_cost(sd, curr_cost);
> +
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);
>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> --
> 
> 	This hunk does a few things:
> 
> 	1. If a task is successfully pulled from shared rq, or if the rq
> 	   lock had been released and re-acquired with, jump to the
> 	   very end where we check a bunch of conditions and return
> 	   accordingly.
> 
> 	2. Move the shared rq picking after the "rd->overload" and
> 	   checks against "rq->avg_idle".
> 
> 	   P.S. This recovered half the performance that was lost.

Sorry, which performance are you referring to? I'm not thrilled about
this part because it's another heuristic for users to have to reason
about. _Maybe_ it makes sense to keep the rd->overload check? I don't
think we should keep the rq->avg_idle check though unless it's
absolutely necessary, and I'd have to think more about the rq->overload
check.

> 	3. Update the newidle_balance_cost via update_newidle_cost()
> 	   since that is also used to determine the previous bailout
> 	   threshold.

I think this makes sense as well, but need to think about it more.

> 	4. A bunch of update_next_balance().

I guess this should be OK, though I would expect us to have to load
balance less with SHARED_RUNQ.

> 	5. Move rq_unpin_lock() below. I do not know the implication of
> 	   this the kernel is not complaining so far (mind you I'm on
> 	   x86 and I do not have lockdep enabled)

If you move rq_unpin_lock(), you should probably run with lockdep to see
what happens :-) There are also implications for tracking whether it's
safe to look at the rq clock.

> 
> 	A combination of 3 and 4 seemed to give back the other half of
> 	tbench performance.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  		update_next_balance(sd, &next_balance);
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;
> +

Agreed on this.

>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> --
> 
> 	This was based on my suggestion in the parallel thread.
> 
> 	P.S. This alone, without the changes in previous hunk showed no
> 	difference in performance with results same as vanilla v3.
> 
> --
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d67d86d3bfdf..f1e64412fd48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  
> --
> 
> 	The last part of newidle_balance() does a bunch of accounting
> 	which is relevant after the above changes. Also the
> 	rq_repin_lock() I had removed now happens here.
> 
> --
> 
> Again most of this is lightly tested with just one workload but I would
> like to hear your thoughts, especially with the significance of
> "rd->overload", "max_newidle_lb_cost", and "update_next_balance()".
> however, I'm afraid these may be the bits that led to the drop in
> utilization you mentioned in the first place.

Exactly :-( I think your proposal to fix how we skip load balancing on
the LLC if SHARED_RUNQ is enabled makes sense, but I'd really prefer to
avoid adding these heuristics to avoid contention for specific
workloads. The goal of SHARED_RUNQ is really to drive up CPU util. I
don't think we're really doing the user many favors if we try to guess
for them when they actually want that to happen.

If there's a time where we _really_ don't want or need to do it then
absolutely, let's skip it. But I would really like to see this go in
without checks on max_newidle_lb_cost, etc. The whole point of
SHARED_RUNQ is that it should be less costly than iterating over O(n)
cores to find tasks, so we _want_ to be more aggressive in doing so.

> Most of the experimentation (except for rq lock contention using IBS)
> was done by reading the newidle_balance() code.

And kudos for doing so!

> Finally a look at newidle_balance counts (tip vs tip + v3 + diff) for
> 128-clients of tbench on the test machine:
> 
> 
> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :     921871,   0	(diff: -100.00%)
> --
> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :     472412,   0	(diff: -100.00%)
> --
> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :        114, 279	(diff: +144.74%)
> --
> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
> load_balance cnt on cpu newly idle                         :          9,   9	(diff: +00.00%)
> --
> 
> Let me know if you have any queries. I'll go back and try to bisect the
> diff to see if only a couple of changes that I thought were important
> are good enought to yield back the lost performance. I'll do wider
> testing post hearing your thoughts.

Hopefully my feedback above should give you enough context to bisect and
find the changes that we really think are most helpful? To reiterate: I
definitely think your change to avoid iterating over the LLC sd is
correct and makes sense. Others possibly do as well such as checking
rd->overload (though not 100% sure), but others such as the
max_newidle_lb_cost checks I would strongly prefer to avoid.

Prateek -- thank you for doing all of this work, it's very much
appreciated.

As I mentioned on the other email, I'll be on vacation for about a week
and a half starting tomorrow afternoon, so I may be slow to respond in
that time.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS
  2023-08-31  1:34     ` David Vernet
@ 2023-08-31  3:47       ` K Prateek Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31  3:47 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On 8/31/2023 7:04 AM, David Vernet wrote:
> On Wed, Aug 30, 2023 at 12:16:17PM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> On 8/10/2023 3:42 AM, David Vernet wrote:
>>> [..snip..]
>>> +	if (p && is_cpu_allowed(p, cpu_of(rq)))
>>> +		list_del_init(&p->shared_runq_node);
>>
>> I wonder if we should remove the task from the list if
>> "is_cpu_allowed()" return false.
>>
>> Consider the following scenario: A task that does not sleep, is pinned
>> to single CPU. Since this is now at the head of the list, and cannot be
>> moved, we leave it there, but since the task also never sleeps, it'll
>> stay there, thus preventing the queue from doing its job.
> 
> Hmm, sorry, I may not be understanding your suggestion. If a task was
> pinned to a single CPU, it would be dequeued from the shared_runq before
> being pinned (see __set_cpus_allowed_ptr_locked()), and then would not
> be added back to the shard in shared_runq_enqueue_task() because of
> p->nr_cpus_allowed == 1. The task would also be dequeued from the shard
> before it started running (unless I'm misunderstanding what you mean by
> "a task that does not sleep"). Please let me know if I'm missing
> something.

Ah! My bad. Completely missed that. Thank you for clarifying.

> 
>> Further implication ...  
>>
>>> +	else
>>> +		p = NULL;
>>> +	raw_spin_unlock(&shared_runq->lock);
>>> +
>>> +	return p;
>>> +}
>>> +
>>> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
>>> +{
>>> +	struct shared_runq *shared_runq;
>>> +
>>> +	shared_runq = rq_shared_runq(rq);
>>> +	raw_spin_lock(&shared_runq->lock);
>>> +	list_add_tail(&p->shared_runq_node, &shared_runq->list);
>>> +	raw_spin_unlock(&shared_runq->lock);
>>> +}
>>>  
>>>  static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
>>> -{}
>>> +{
>>> +	/*
>>> +	 * Only enqueue the task in the shared runqueue if:
>>> +	 *
>>> +	 * - SHARED_RUNQ is enabled
>>> +	 * - The task isn't pinned to a specific CPU
>>> +	 */
>>> +	if (p->nr_cpus_allowed == 1)
>>> +		return;
>>> +
>>> +	shared_runq_push_task(rq, p);
>>> +}
>>>  
>>>  static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>>  {
>>> -	return 0;
>>> +	struct task_struct *p = NULL;
>>> +	struct rq *src_rq;
>>> +	struct rq_flags src_rf;
>>> +	int ret = -1;
>>> +
>>> +	p = shared_runq_pop_task(rq);
>>> +	if (!p)
>>> +		return 0;
>>
>> ...
>>
>> Since we return 0 here in such a scenario, we'll take the old
>> newidle_balance() path but ...
>>
>>> +
>>> +	rq_unpin_lock(rq, rf);
>>> +	raw_spin_rq_unlock(rq);
>>> +
>>> +	src_rq = task_rq_lock(p, &src_rf);
>>> +
>>> +	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>> +		update_rq_clock(src_rq);
>>> +		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>>> +		ret = 1;
>>> +	}
>>> +
>>> +	if (src_rq != rq) {
>>> +		task_rq_unlock(src_rq, p, &src_rf);
>>> +		raw_spin_rq_lock(rq);
>>> +	} else {
>>> +		rq_unpin_lock(rq, &src_rf);
>>> +		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>> +	}
>>> +	rq_repin_lock(rq, rf);
>>> +
>>> +	return ret;
>>>  }
>>>  
>>>  static void shared_runq_dequeue_task(struct task_struct *p)
>>> -{}
>>> +{
>>> +	struct shared_runq *shared_runq;
>>> +
>>> +	if (!list_empty(&p->shared_runq_node)) {
>>> +		shared_runq = rq_shared_runq(task_rq(p));
>>> +		raw_spin_lock(&shared_runq->lock);
>>> +		/*
>>> +		 * Need to double-check for the list being empty to avoid
>>> +		 * racing with the list being drained on the domain recreation
>>> +		 * or SHARED_RUNQ feature enable / disable path.
>>> +		 */
>>> +		if (likely(!list_empty(&p->shared_runq_node)))
>>> +			list_del_init(&p->shared_runq_node);
>>> +		raw_spin_unlock(&shared_runq->lock);
>>> +	}
>>> +}
>>>  
>>>  /*
>>>   * For asym packing, by default the lower numbered CPU has higher priority.
>>> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>  	rcu_read_lock();
>>>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>>  
>>> +	/*
>>> +	 * Skip <= LLC domains as they likely won't have any tasks if the
>>> +	 * shared runq is empty.
>>> +	 */
>>
>> ... now we skip all the way ahead of MC domain, overlooking any
>> imbalance that might still exist within the SMT and MC groups
>> since shared runq is not exactly empty.
>>
>> Let me know if I've got something wrong!
> 
> Yep, I mentioned this to Gautham as well in [0].
> 
> [0]: https://lore.kernel.org/all/20230818050355.GA5718@maniforge/
> 
> I agree that I think we should remove this heuristic from v4. Either
> that, or add logic to iterate over the shared_runq until a viable task
> is found.
> 
>>
>>> +	if (sched_feat(SHARED_RUNQ)) {
>>> +		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> +		if (likely(sd))
>>> +			sd = sd->parent;
>>> +	}
>>
>> Speaking of skipping ahead of MC domain, I don't think this actually
>> works since the domain traversal uses the "for_each_domain" macro
>> which is defined as:
> 
> *blinks*
> 
> Uhhh, yeah, wow. Good catch!
> 
>> #define for_each_domain(cpu, __sd) \
>> 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
>> 			__sd; __sd = __sd->parent)
>>
>> The traversal starts from rq->sd overwriting your initialized value
>> here. This is why we see "load_balance count on cpu newly idle" in
>> Gautham's first report
>> (https://lore.kernel.org/lkml/ZN3dW5Gvcb0LFWjs@BLR-5CG11610CF.amd.com/)
>> to be non-zero.
>>
>> One way to do this would be as follows:
>>
>> static int newidle_balance() {
>>
>> 	...
>> 	for_each_domain(this_cpu, sd) {
>>
>> 		...
>> 		/* Skip balancing until LLc domain */
>> 		if (sched_feat(SHARED_RUNQ) &&
>> 		    (sd->flags & SD_SHARE_PKG_RESOURCES))
>> 			continue;
>>
>> 		...
>> 	}
>> 	...
>> }
> 
> Yep, I think this makes sense to do.
> 
>> With this I see the newidle balance count for SMT and MC domain
>> to be zero:
> 
> And indeed, I think this was the intention. Thanks again for catching
> this. I'm excited to try this out when running benchmarks for v4.
> 
>> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
>> --
>> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
>> --
>> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :       2170    $      9.319 $    [   17.42832 ]
>> --
>> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :         30    $    674.067 $    [    0.24094 ]
>> --
>>
>> Let me know if I'm missing something here :)
> 
> No, I think you're correct, we should be doing this. Assuming we want to
> keep this heuristic, I think the block above is also correct so that we
> properly account sd->max_newidle_lb_cost and rq->next_balance. Does that
> make sense to you too?

Yup, that makes sense!

> 
>>
>> Note: The lb counts for DIE and NUMA are down since I'm experimenting
>> with the implementation currently. I'll update of any new findings on
>> the thread.
> 
> Ack, thank you for doing that.
> 
> Just FYI, I'll be on vacation for about 1.5 weeks starting tomorrow
> afternoon. If I'm slow to respond, that's why.

Enjoy your vacation :)

I'll keep experimenting meanwhile and update the report in the
parallel thread.

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-31  2:32             ` David Vernet
@ 2023-08-31  4:21               ` K Prateek Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31  4:21 UTC (permalink / raw)
  To: David Vernet
  Cc: Gautham R. Shenoy, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

Thank you for taking a look at the report.

On 8/31/2023 8:02 AM, David Vernet wrote:
> On Wed, Aug 30, 2023 at 03:26:40PM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> Short update based on some of my experimentation.
>>
>> Disclaimer: I've been only looking at tbench 128 client case on a dual
>> socket 3rd Generation EPYC system (2x 64C/128T). Wider results may
>> vary but I have some information that may help with the debug and to
>> proceed further.
>>
>> On 8/25/2023 4:21 AM, David Vernet wrote:
>>> On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
>>>> Hello David,
>>>>
>>>> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
>>>>> Hello David,
>>>>>
>>>>> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
>>>>>> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
>>>>>>> Hello David,
>>>>>>
>>>>>> Hello Gautham,
>>>>>>
>>>>>> Thanks a lot as always for running some benchmarks and analyzing these
>>>>>> changes.
>>>>>>
>>>>>>> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
>>>>>>>> Changes
>>>>>>>> -------
>>>>>>>>
>>>>>>>> This is v3 of the shared runqueue patchset. This patch set is based off
>>>>>>>> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>>>>>>>> bandwidth in use") on the sched/core branch of tip.git.
>>>>>>>
>>>>>>>
>>>>>>> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
>>>>>>> notice that apart from hackbench, every other bechmark is showing
>>>>>>> regressions with this patch series. Quick summary of my observations:
>>>>>>
>>>>>> Just to verify per our prior conversation [0], was this latest set of
>>>>>> benchmarks run with boost disabled?
>>>>>
>>>>> Boost is enabled by default. I will queue a run tonight with boost
>>>>> disabled.
>>>>
>>>> Apologies for the delay. I didn't see any changes with boost-disabled
>>>> and with reverting the optimization to bail out of the
>>>> newidle_balance() for SMT and MC domains when there was no task to be
>>>> pulled from the shared-runq. I reran the whole thing once again, just
>>>> to rule out any possible variance. The results came out the same.
>>>
>>> Thanks a lot for taking the time to run more benchmarks.
>>>
>>>> With the boost disabled, and the optimization reverted, the results
>>>> don't change much.
>>>
>>> Hmmm, I see. So, that was the only real substantive "change" between v2
>>> -> v3. The other changes were supporting hotplug / domain recreation,
>>> optimizing locking a bit, and fixing small bugs like the return value
>>> from shared_runq_pick_next_task(), draining the queue when the feature
>>> is disabled, and fixing the lkp errors.
>>>
>>> With all that said, it seems very possible that the regression is due to
>>> changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
>>> multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
>>> and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>>> bandwidth in use") in v3. EEVDF was merged in that window, so that could
>>> be one explanation for the context switch rate being so much higher.
>>>
>>>> It doesn't appear that the optimization is the cause for increase in
>>>> the number of load-balancing attempts at the DIE and the NUMA
>>>> domains. I have shared the counts of the newidle_balance with and
>>>> without SHARED_RUNQ below for tbench and it can be noticed that the
>>>> counts are significantly higher for the 64 clients and 128 clients. I
>>>> also captured the counts/s of find_busiest_group() using funccount.py
>>>> which tells the same story. So the drop in the performance for tbench
>>>> with your patches strongly correlates with the increase in
>>>> load-balancing attempts.
>>>>
>>>> newidle balance is undertaken only if the overload flag is set and the
>>>> expected idle duration is greater than the avg load balancing cost. It
>>>> is hard to imagine why should the shared runq cause the overload flag
>>>> to be set!
>>>
>>> Yeah, I'm not sure either about how or why woshared_runq uld cause this
>>> This is purely hypothetical, but is it possible that shared_runq causes
>>> idle cores to on average _stay_ idle longer due to other cores pulling
>>> tasks that would have otherwise been load balanced to those cores?
>>>
>>> Meaning -- say CPU0 is idle, and there are tasks on other rqs which
>>> could be load balanced. Without shared_runq, CPU0 might be woken up to
>>> run a task from a periodic load balance. With shared_runq, any active
>>> core that would otherwise have gone idle could pull the task, keeping
>>> CPU0 idle.
>>>
>>> What do you think? I could be totally off here.
>>>
>>> From my perspective, I'm not too worried about this given that we're
>>> seeing gains in other areas such as kernel compile as I showed in [0],
>>> though I definitely would like to better understand it.
>>
>> Let me paste a cumulative diff containing everything I've tried since
>> it'll be easy to explain.
>>
>> o Performance numbers for tbench 128 clients:
>>
>> tip			: 1.00 (Var: 0.57%)
>> tip + vanilla v3	: 0.39 (var: 1.15%) (%diff: -60.74%)
>> tip + v3 + diff		: 0.99 (var: 0.61%) (%diff: -00.24%)
>>
>> tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when
>> cfs bandwidth in use"), same as what Gautham used, so no EEVDF yet.
>>
>> o Cumulative Diff
>>
>> Should apply cleanly on top of tip at above commit + this series as is.
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>>  } ____cacheline_aligned;
>>  
>>  /* This would likely work better as a configurable knob via debugfs */
>> -#define SHARED_RUNQ_SHARD_SZ 6
>> +#define SHARED_RUNQ_SHARD_SZ 16
>>  #define SHARED_RUNQ_MAX_SHARDS \
>>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>>  
>> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>>  }
>>  
>>  static struct task_struct *
>> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>>  {
>> +	int target = cpu_of(rq);
>>  	struct task_struct *p;
>>  
>>  	if (list_empty(&shard->list))
>>  		return NULL;
>>  
>>  	raw_spin_lock(&shard->lock);
>> +again:
>>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>>  				     shared_runq_node);
>> -	if (p && is_cpu_allowed(p, target))
>> +
>> +	/* If we find a task, delete if from list regardless */
>> +	if (p) {
>>  		list_del_init(&p->shared_runq_node);
>> -	else
>> -		p = NULL;
>> +
>> +		if (!task_on_rq_queued(p) ||
>> +		    task_on_cpu(task_rq(p), p) ||
> 
> Have you observed !task_on_rq_queued() or task_on_cpu() returning true
> here? The task should have removed itself from the shard when
> __dequeue_entity() is called from set_next_entity() when it's scheduled
> in pick_next_task_fair(). The reason we have to check in
> shared_runq_pick_next_task() is that between dequeuing the task from the
> shared_runq and getting its rq lock, it could have been scheduled on its
> current rq. But if the task was scheduled first, it should have removed
> itself from the shard.

Ah! Thank you for clarifying. This is just a copy-paste of the bailout
in shared_runq_pick_next_task(). I see "!task_on_rq_queued()" cannot be
true here since this is with the shared rq lock held. Thank you for
pointing this out.

> 
>> +		    !is_cpu_allowed(p, target)) {
>> +			if (rq->ttwu_pending) {
>> +				p = NULL;
>> +				goto out;
>> +			}
> 
> Have you observed this as well? If the task is enqueued on the ttwu
> queue wakelist, it isn't enqueued on the waking CPU, so it shouldn't be
> added to the shared_runq right?

This is a bailout on the retry logic. Since the shared_rq could contain
many tasks, I did not want to retry until the queue goes empty with a
possible pending wakeup. ... [1]

> 
>> +
>> +			goto again;
>> +		}
>> +	}
>> +
>> +out:
>>  	raw_spin_unlock(&shard->lock);
>>  
>>  	return p;
>> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		curr_idx = (starting_idx + i) % num_shards;
>>  		shard = &shared_runq->shards[curr_idx];
>>  
>> -		p = shared_runq_pop_task(shard, cpu_of(rq));
>> +		p = shared_runq_pop_task(shard, rq);
>>  		if (p)
>>  			break;
>> +
>> +		if (rq->ttwu_pending)
>> +			return 0;
> 
> Same here r.e. rq->ttwu_pending. This should be handled in the
> 
> if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p))
> 
> check below, no? Note that task_on_rq_queued(p) should only return true
> if the task has made it to ttwu_do_activate(), and if it hasn't, I don't
> think it should be in the shard in the first place.

Noted! Thank you for clarifying again. 

> 
>>  	}
>>  	if (!p)
>>  		return 0;
>> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>  		update_rq_clock(src_rq);
>>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>> -		ret = 1;
>>  	}
>>  
>>  	if (src_rq != rq) {
>>  		task_rq_unlock(src_rq, p, &src_rf);
>>  		raw_spin_rq_lock(rq);
>>  	} else {
>> +		ret = 1;
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
> 
> Huh, wouldn't this cause a WARN to be issued the next time we invoke
> rq_clock() in newidle_balance() if we weren't able to find a task? Or
> was it because we moved the SHARED_RUNQ logic to below where we check
> rq_clock()? In general though, I don't think this should be removed. At
> the very least, it should be tested with lockdep.

So beyond this point, ret != 0, which will now jump to "out_swq" label
in newidle_balance()  which does a "rq_repin_lock(this_rq, rf)" just
before returning.

I'll check if my surgery is right with lockdep enabled.

> 
>>  	return ret;
>>  }
>> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>>  	rcu_read_unlock();
>>  
>> +	t0 = sched_clock_cpu(this_cpu);
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +			goto out_swq;
>> +		}
>> +	}
> 
> Hmmm, why did you move this further down in newidle_balance()? We don't
> want to skip trying to get a task from the shared_runq if rq->avg_idle <
> sd->max_newidle_lb_cost.

I'll check if only rd->overload check is sufficient.

> 
>> +
>> +	/* Check again for pending wakeups */
>> +	if (this_rq->ttwu_pending)
>> +		return 0;
>> +
>> +	t1 = sched_clock_cpu(this_cpu);
>> +	curr_cost += t1 - t0;
>> +
>> +	if (sd)
>> +		update_newidle_cost(sd, curr_cost);
>> +
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  		update_next_balance(sd, &next_balance);
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
> 
> This makes sense to me, good call.
> 
>> +
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
>> --
>>
>> o Breakdown
>>
>> I'll proceed to annotate a copy of diff with reasoning behind the changes:
> 
> Ah, ok, you provided explanations :-) I'll leave my questions above
> regardless for posterity.

And I'll refer to answers above ;)

> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>>  } ____cacheline_aligned;
>>  
>>  /* This would likely work better as a configurable knob via debugfs */
>> -#define SHARED_RUNQ_SHARD_SZ 6
>> +#define SHARED_RUNQ_SHARD_SZ 16
>>  #define SHARED_RUNQ_MAX_SHARDS \
>>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>>
>> --
>>
>> 	Here I'm setting the SHARED_RUNQ_SHARD_SZ to sd_llc_size for
>> 	my machine. I played around with this and did not see any
>> 	contention for shared_rq lock while running tbench.
> 
> I don't really mind making the shard bigger because it will never be one
> size fits all, but for what it's worth, I saw less contention in netperf
> with a size of 6, but everything else performed fine with a larger
> shard. This is one of those "there's no right answer" things, I'm
> afraid. I think it will be inevitable to make this configurable at some
> point, if we find that it's really causing inefficiencies.

Agreed! For tbench at least, this did not lead to any problems.

> 
>> --
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>>  }
>>  
>>  static struct task_struct *
>> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>>  {
>> +	int target = cpu_of(rq);
>>  	struct task_struct *p;
>>  
>>  	if (list_empty(&shard->list))
>>  		return NULL;
>>  
>>  	raw_spin_lock(&shard->lock);
>> +again:
>>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>>  				     shared_runq_node);
>> -	if (p && is_cpu_allowed(p, target))
>> +
>> +	/* If we find a task, delete if from list regardless */
> 
> As I mentioned in my other reply [0], I don't think we should always
> have to delete here. Let me know if I'm missing something.

I overlooked that condition at enqueue. I'll go back to original
implementation here.

> 
> [0]: https://lore.kernel.org/all/20230831013435.GB506447@maniforge/
> 
>> +	if (p) {
>>  		list_del_init(&p->shared_runq_node);
>> -	else
>> -		p = NULL;
>> +
>> +		if (!task_on_rq_queued(p) ||
>> +		    task_on_cpu(task_rq(p), p) ||
>> +		    !is_cpu_allowed(p, target)) {
>> +			if (rq->ttwu_pending) {
>> +				p = NULL;
>> +				goto out;
>> +			}
>> +
>> +			goto again;
>> +		}
>> +	}
>> +
>> +out:
>>  	raw_spin_unlock(&shard->lock);
>>  
>>  	return p;
>> --
>>
>> 	Context: When running perf with IBS, I saw following lock
>> 	contention:
>>
>> -   12.17%  swapper          [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>>    - 10.48% native_queued_spin_lock_slowpath
>>       - 10.30% _raw_spin_lock
>>          - 9.11% __schedule
>>               schedule_idle
>>               do_idle
>>             + cpu_startup_entry
>>          - 0.86% task_rq_lock
>>               newidle_balance
>>               pick_next_task_fair
>>               __schedule
>>               schedule_idle
>>               do_idle
>>             + cpu_startup_entry
>>
>> 	So I imagined the newidle_balance is contending with another
>> 	run_queue going idle when pulling task. Hence, I moved some
>> 	checks in shared_runq_pick_next_task() to here.
> 
> Hmm, so the idea was to avoid contending on the rq lock? As I mentioned
> above, I'm not sure these checks actually buy us anything.

Yup! I think skipping newidle_balance() when rd->overload is not set
itself reduces this contention.

> 
>> 	I was not sure if the task's rq lock needs to be held to do this
>> 	to get an accurate result so I've left the original checks in
>> 	shared_runq_pick_next_task() as it is.
> 
> Yep, we need to have the rq lock held for these functions to return
> consistent data.

Noted! Will fall back to your implementation. Also I did not see
any perf improvement for tbench with this alone.

> 
>> 	Since retry may be costly, I'm using "rq->ttwu_pending" as a
>> 	bail out threshold. Maybe there are better alternates with
>> 	the lb_cost and rq->avg_idle but this was simpler for now.
> 
> Hmm, not sure I'm quite understanding this one. As I mentioned above, I
> don't _think_ this should ever be set for a task enqueued in a shard.
> Were you observing that?

Explained in [1] above. Let me get rid of that whole retry logic
because I see it leading to shared_rq lock contention at enqueue and
dequeue already.

> 
>> 	(Realizing as I write this that this will cause more contention
>> 	with enqueue/dequeue in a busy system. I'll check if that is the
>> 	case)
>>
>> 	P.S. This did not affect the ~60% regression I was seeing one
>> 	bit so the problem was deeper.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		curr_idx = (starting_idx + i) % num_shards;
>>  		shard = &shared_runq->shards[curr_idx];
>>  
>> -		p = shared_runq_pop_task(shard, cpu_of(rq));
>> +		p = shared_runq_pop_task(shard, rq);
>>  		if (p)
>>  			break;
>> +
>> +		if (rq->ttwu_pending)
>> +			return 0;
>>  	}
>>  	if (!p)
>>  		return 0;
>> --
>>
>> 	More bailout logic.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>  		update_rq_clock(src_rq);
>>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>> -		ret = 1;
>>  	}
>>  
>>  	if (src_rq != rq) {
>>  		task_rq_unlock(src_rq, p, &src_rf);
>>  		raw_spin_rq_lock(rq);
>>  	} else {
>> +		ret = 1;
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
>>  
>>  	return ret;
>>  }
>> --
>>
>> 	Only return 1 is task is actually pulled else return -1
>> 	signifying the path has released and re-aquired the lock.
> 
> Not sure I'm following. If we migrate the task to the current rq, we
> want to return 1 to signify that there are new fair tasks present in the
> rq don't we? It doesn't need to have started there originally for it to
> be present after we move it.

Above move_queued_task(), I see the following comment

	Returns (locked) new rq. Old rq's lock is released.

so wouldn't "src_rq != rq" signify a failed migration? I'm probably
missing something here.

> 
>>
>> 	Also leave the rq_repin_lock() part to caller, i.e.,
>> 	newidle_balance() since it makes up for a nice flow (see
>> 	below).
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>>  	rcu_read_unlock();
>>  
>> +	t0 = sched_clock_cpu(this_cpu);
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +			goto out_swq;
>> +		}
>> +	}
>> +
>> +	/* Check again for pending wakeups */
>> +	if (this_rq->ttwu_pending)
>> +		return 0;
>> +
>> +	t1 = sched_clock_cpu(this_cpu);
>> +	curr_cost += t1 - t0;
>> +
>> +	if (sd)
>> +		update_newidle_cost(sd, curr_cost);
>> +
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> --
>>
>> 	This hunk does a few things:
>>
>> 	1. If a task is successfully pulled from shared rq, or if the rq
>> 	   lock had been released and re-acquired with, jump to the
>> 	   very end where we check a bunch of conditions and return
>> 	   accordingly.
>>
>> 	2. Move the shared rq picking after the "rd->overload" and
>> 	   checks against "rq->avg_idle".
>>
>> 	   P.S. This recovered half the performance that was lost.
> 
> Sorry, which performance are you referring to? I'm not thrilled about
> this part because it's another heuristic for users to have to reason
> about. _Maybe_ it makes sense to keep the rd->overload check? I don't
> think we should keep the rq->avg_idle check though unless it's
> absolutely necessary, and I'd have to think more about the rq->overload
> check.

Performance everywhere is tbench-128 clients on test machine since I was
very focused on that alone. I understand why you are "not thrilled"
here. Let me go check which one of those conditions is more relevant.

(P.S. Anna-Maria did some experiments around avg_idle and tbench in
https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de/) 

> 
>> 	3. Update the newidle_balance_cost via update_newidle_cost()
>> 	   since that is also used to determine the previous bailout
>> 	   threshold.
> 
> I think this makes sense as well, but need to think about it more.
> 
>> 	4. A bunch of update_next_balance().
> 
> I guess this should be OK, though I would expect us to have to load
> balance less with SHARED_RUNQ.
> 
>> 	5. Move rq_unpin_lock() below. I do not know the implication of
>> 	   this the kernel is not complaining so far (mind you I'm on
>> 	   x86 and I do not have lockdep enabled)
> 
> If you move rq_unpin_lock(), you should probably run with lockdep to see
> what happens :-) There are also implications for tracking whether it's
> safe to look at the rq clock.

Yup! I traced the path in code and it looked okay but this is just me
doing naive experiments. Lockdep should set me straight :)

> 
>>
>> 	A combination of 3 and 4 seemed to give back the other half of
>> 	tbench performance.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  		update_next_balance(sd, &next_balance);
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
>> +
> 
> Agreed on this.
> 
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> --
>>
>> 	This was based on my suggestion in the parallel thread.
>>
>> 	P.S. This alone, without the changes in previous hunk showed no
>> 	difference in performance with results same as vanilla v3.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
>> --
>>
>> 	The last part of newidle_balance() does a bunch of accounting
>> 	which is relevant after the above changes. Also the
>> 	rq_repin_lock() I had removed now happens here.
>>
>> --
>>
>> Again most of this is lightly tested with just one workload but I would
>> like to hear your thoughts, especially with the significance of
>> "rd->overload", "max_newidle_lb_cost", and "update_next_balance()".
>> however, I'm afraid these may be the bits that led to the drop in
>> utilization you mentioned in the first place.
> 
> Exactly :-( I think your proposal to fix how we skip load balancing on
> the LLC if SHARED_RUNQ is enabled makes sense, but I'd really prefer to
> avoid adding these heuristics to avoid contention for specific
> workloads. The goal of SHARED_RUNQ is really to drive up CPU util. I
> don't think we're really doing the user many favors if we try to guess
> for them when they actually want that to happen.
> 
> If there's a time where we _really_ don't want or need to do it then
> absolutely, let's skip it. But I would really like to see this go in
> without checks on max_newidle_lb_cost, etc. The whole point of
> SHARED_RUNQ is that it should be less costly than iterating over O(n)
> cores to find tasks, so we _want_ to be more aggressive in doing so.

Agreed! I'll try to get a more palatable diff by the time you are
back from vacation.

> 
>> Most of the experimentation (except for rq lock contention using IBS)
>> was done by reading the newidle_balance() code.
> 
> And kudos for doing so!
> 
>> Finally a look at newidle_balance counts (tip vs tip + v3 + diff) for
>> 128-clients of tbench on the test machine:
>>
>>
>> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :     921871,   0	(diff: -100.00%)
>> --
>> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :     472412,   0	(diff: -100.00%)
>> --
>> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :        114, 279	(diff: +144.74%)
>> --
>> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          9,   9	(diff: +00.00%)
>> --
>>
>> Let me know if you have any queries. I'll go back and try to bisect the
>> diff to see if only a couple of changes that I thought were important
>> are good enought to yield back the lost performance. I'll do wider
>> testing post hearing your thoughts.
> 
> Hopefully my feedback above should give you enough context to bisect and
> find the changes that we really think are most helpful? To reiterate: I
> definitely think your change to avoid iterating over the LLC sd is
> correct and makes sense. Others possibly do as well such as checking
> rd->overload (though not 100% sure), but others such as the
> max_newidle_lb_cost checks I would strongly prefer to avoid.
> 
> Prateek -- thank you for doing all of this work, it's very much
> appreciated.

And thank you for patiently going through it all and clarifying all the
bits I have been missing :)

> 
> As I mentioned on the other email, I'll be on vacation for about a week
> and a half starting tomorrow afternoon, so I may be slow to respond in
> that time.

Enjoy your vacation! I'll keep the tread updated for you to read when
you get back :)

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff
  2023-08-30  9:56           ` K Prateek Nayak
  2023-08-31  2:32             ` David Vernet
@ 2023-08-31 10:45             ` K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
                                 ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 10:45 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team, kprateek.nayak

Since the diff is a concoction of a bunch of things that somehow work,
this series tries to clean it up. I've lost a bunch of things based on
David's suggestion [1], [2] and added some new logic on top that is
covered in Patch 3.

Breakdown is as follows:

- Patch 1 moves struct definition to sched.h

- Patch 2 is the above diff but more palatable with changes based on
  David's comments.

- Patch 3 adds a bailout mechanism on top, since I saw the same amount
  of regression with Patch2.

With these changes, following are the results for tbench 128-clients:

tip				: 1.00 (var: 1.00%)
tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)

Disclaimer: All the testing is done hyper-focused on tbench 128-clients
case on a dual socket 3rd Generation EPYC system (2 x 64C/128T). The
series should apply cleanly on top of tip at commit 88c56cfeaec4
("sched/fair: Block nohz tick_stop when cfs bandwidth in use") + v3 of
shared_runq series (this series)

The SHARED_RUNQ_SHARD_SZ was set to 16 throughout the testing since that
maches the sd_llc_size on the system.

P.S. I finally got to enabling lockdep and I saw the following splat
early during the boot but nothing after (so I think everything is
alright?):

  ================================
  WARNING: inconsistent lock state
  6.5.0-rc2-shared-wq-v3-fix+ #681 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
  ffff95f6bb24d818 (&rq->__lock){?.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x15/0x30
  {IN-HARDIRQ-W} state was registered at:
    lock_acquire+0xcc/0x2c0
    _raw_spin_lock_nested+0x2e/0x40
    scheduler_tick+0x5c/0x350
    update_process_times+0x83/0x90
    tick_periodic+0x27/0xe0
    tick_handle_periodic+0x24/0x70
    timer_interrupt+0x18/0x30
    __handle_irq_event_percpu+0x8b/0x240
    handle_irq_event+0x38/0x80
    handle_level_irq+0x90/0x170
    __common_interrupt+0x4f/0x110
    common_interrupt+0x7f/0xa0
    asm_common_interrupt+0x26/0x40
    __x86_return_thunk+0x0/0x40
    console_flush_all+0x2e3/0x590
    console_unlock+0x56/0x100
    vprintk_emit+0x153/0x350
    _printk+0x5c/0x80
    apic_intr_mode_init+0x85/0x110
    x86_late_time_init+0x24/0x40
    start_kernel+0x5e1/0x7a0
    x86_64_start_reservations+0x18/0x30
    x86_64_start_kernel+0x92/0xa0
    secondary_startup_64_no_verify+0x17e/0x18b
  irq event stamp: 65081
  hardirqs last  enabled at (65081): [<ffffffff857723c1>] _raw_spin_unlock_irqrestore+0x31/0x60
  hardirqs last disabled at (65080): [<ffffffff857720d3>] _raw_spin_lock_irqsave+0x63/0x70
  softirqs last  enabled at (64284): [<ffffffff848ccb7b>] __irq_exit_rcu+0x7b/0xa0
  softirqs last disabled at (64269): [<ffffffff848ccb7b>] __irq_exit_rcu+0x7b/0xa0
 
  other info that might help us debug this:
   Possible unsafe locking scenario:
 
         CPU0
         ----
    lock(&rq->__lock);
    <Interrupt>
      lock(&rq->__lock);
 
   *** DEADLOCK ***
 
  1 lock held by swapper/0/1:
   #0: ffffffff8627eec8 (sched_domains_mutex){+.+.}-{4:4}, at: sched_init_smp+0x3f/0xd0
 
  stack backtrace:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc2-shared-wq-v3-fix+ #681
  Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5c/0x90
   mark_lock.part.0+0x755/0x930
   ? __lock_acquire+0x3e7/0x21d0
   ? __lock_acquire+0x2f0/0x21d0
   __lock_acquire+0x3ab/0x21d0
   ? lock_is_held_type+0xaa/0x130
   lock_acquire+0xcc/0x2c0
   ? raw_spin_rq_lock_nested+0x15/0x30
   ? free_percpu+0x245/0x4a0
   _raw_spin_lock_nested+0x2e/0x40
   ? raw_spin_rq_lock_nested+0x15/0x30
   raw_spin_rq_lock_nested+0x15/0x30
   update_domains_fair+0xf1/0x220
   sched_update_domains+0x32/0x50
   sched_init_domains+0xd9/0x100
   sched_init_smp+0x4b/0xd0
   ? stop_machine+0x32/0x40
   kernel_init_freeable+0x2d3/0x540
   ? __pfx_kernel_init+0x10/0x10
   kernel_init+0x1a/0x1c0
   ret_from_fork+0x34/0x50
   ? __pfx_kernel_init+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
  RIP: 0000:0x0
  Code: Unable to access opcode bytes at 0xffffffffffffffd6.
  RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   </TASK>

References:

[1] https://lore.kernel.org/all/20230831013435.GB506447@maniforge/
[2] https://lore.kernel.org/all/20230831023254.GC506447@maniforge/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h
  2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
@ 2023-08-31 10:45               ` K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
  2 siblings, 0 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 10:45 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team, kprateek.nayak

Move struct shared_runq_shard, struct shared_runq, SHARED_RUNQ_SHARD_SZ
and SHARED_RUNQ_MAX_SHARDS definitions into sched.h

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 68 --------------------------------------------
 kernel/sched/sched.h | 68 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+), 68 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d67d86d3bfdf..bf844ffa79c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,74 +139,6 @@ static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
-/**
- * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
- * runnable tasks within an LLC.
- *
- * struct shared_runq_shard - A structure containing a task list and a spinlock
- * for a subset of cores in a struct shared_runq.
- *
- * WHAT
- * ====
- *
- * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
- * then be pulled from when another core in the LLC is going to go idle.
- *
- * struct rq stores two pointers in its struct cfs_rq:
- *
- * 1. The per-LLC struct shared_runq which contains one or more shards of
- *    enqueued tasks.
- *
- * 2. The shard inside of the per-LLC struct shared_runq which contains the
- *    list of runnable tasks for that shard.
- *
- * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
- * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
- * newidle_balance(). Pulling from shards is an O(# shards) operation.
- *
- * There is currently no task-stealing between shared_runqs in different LLCs,
- * which means that shared_runq is not fully work conserving. This could be
- * added at a later time, with tasks likely only being stolen across
- * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
- *
- * HOW
- * ===
- *
- * A struct shared_runq_shard is comprised of a list, and a spinlock for
- * synchronization.  Given that the critical section for a shared_runq is
- * typically a fast list operation, and that the shared_runq_shard is localized
- * to a subset of cores on a single LLC (plus other cores in the LLC that pull
- * from the shard in newidle_balance()), the spinlock will typically only be
- * contended on workloads that do little else other than hammer the runqueue.
- *
- * WHY
- * ===
- *
- * As mentioned above, the main benefit of shared_runq is that it enables more
- * aggressive work conservation in the scheduler. This can benefit workloads
- * that benefit more from CPU utilization than from L1/L2 cache locality.
- *
- * shared_runqs are segmented across LLCs both to avoid contention on the
- * shared_runq spinlock by minimizing the number of CPUs that could contend on
- * it, as well as to strike a balance between work conservation, and L3 cache
- * locality.
- */
-struct shared_runq_shard {
-	struct list_head list;
-	raw_spinlock_t lock;
-} ____cacheline_aligned;
-
-/* This would likely work better as a configurable knob via debugfs */
-#define SHARED_RUNQ_SHARD_SZ 6
-#define SHARED_RUNQ_MAX_SHARDS \
-	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
-
-struct shared_runq {
-	unsigned int num_shards;
-	struct shared_runq_shard shards[SHARED_RUNQ_MAX_SHARDS];
-} ____cacheline_aligned;
-
 #ifdef CONFIG_SMP
 
 static DEFINE_PER_CPU(struct shared_runq, shared_runqs);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b504f8f4416b..f50176f720b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -545,6 +545,74 @@ do {									\
 # define u64_u32_load(var)      u64_u32_load_copy(var, var##_copy)
 # define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val)
 
+/**
+ * struct shared_runq - Per-LLC queue structure for enqueuing and migrating
+ * runnable tasks within an LLC.
+ *
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
+ * WHAT
+ * ====
+ *
+ * This structure enables the scheduler to be more aggressively work
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
+ *
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard in
+ * __enqueue_entity(), and are opportunistically pulled from the shared_runq in
+ * newidle_balance(). Pulling from shards is an O(# shards) operation.
+ *
+ * There is currently no task-stealing between shared_runqs in different LLCs,
+ * which means that shared_runq is not fully work conserving. This could be
+ * added at a later time, with tasks likely only being stolen across
+ * shared_runqs on the same NUMA node to avoid violating NUMA affinities.
+ *
+ * HOW
+ * ===
+ *
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
+ *
+ * WHY
+ * ===
+ *
+ * As mentioned above, the main benefit of shared_runq is that it enables more
+ * aggressive work conservation in the scheduler. This can benefit workloads
+ * that benefit more from CPU utilization than from L1/L2 cache locality.
+ *
+ * shared_runqs are segmented across LLCs both to avoid contention on the
+ * shared_runq spinlock by minimizing the number of CPUs that could contend on
+ * it, as well as to strike a balance between work conservation, and L3 cache
+ * locality.
+ */
+struct shared_runq_shard {
+	struct list_head list;
+	raw_spinlock_t lock;
+} ____cacheline_aligned;
+
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+#define SHARED_RUNQ_MAX_SHARDS \
+	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
+
+struct shared_runq {
+	unsigned int num_shards;
+	struct shared_runq_shard shards[SHARED_RUNQ_MAX_SHARDS];
+} ____cacheline_aligned;
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
  2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
@ 2023-08-31 10:45               ` K Prateek Nayak
  2023-08-31 18:45                 ` David Vernet
  2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
  2 siblings, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 10:45 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team, kprateek.nayak

This patch takes the relevant optimizations from [1] in
newidle_balance(). Following is the breakdown:

- Check "rq->rd->overload" before jumping into newidle_balance, even
  with SHARED_RQ feat enabled.

- Call update_next_balance() for all the domains till MC Domain in
  when SHARED_RQ path is taken.

- Account cost from shared_runq_pick_next_task() and update
  curr_cost and sd->max_newidle_lb_cost accordingly.

- Move the initial rq_unpin_lock() logic around. Also, the caller of
  shared_runq_pick_next_task() is responsible for calling
  rq_repin_lock() if the return value is non zero. (Needs to be verified
  everything is right with LOCKDEP)

- Includes a fix to skip directly above the LLC domain when calling the
  load_balance() in newidle_balance()

All other surgery from [1] has been removed.

Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 67 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf844ffa79c2..446ffdad49e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 		rq_unpin_lock(rq, &src_rf);
 		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
 	}
-	rq_repin_lock(rq, rf);
 
 	return ret;
 }
@@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!cpu_active(this_cpu))
 		return 0;
 
-	if (sched_feat(SHARED_RUNQ)) {
-		pulled_task = shared_runq_pick_next_task(this_rq, rf);
-		if (pulled_task)
-			return pulled_task;
-	}
-
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
 	 */
 	this_rq->idle_stamp = rq_clock(this_rq);
 
-	/*
-	 * This is OK, because current is on_cpu, which avoids it being picked
-	 * for load-balance and preemption/IRQs are still disabled avoiding
-	 * further scheduler activity on it and we're being very careful to
-	 * re-start the picking loop.
-	 */
-	rq_unpin_lock(this_rq, rf);
-
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-
-	/*
-	 * Skip <= LLC domains as they likely won't have any tasks if the
-	 * shared runq is empty.
-	 */
-	if (sched_feat(SHARED_RUNQ)) {
+	if (sched_feat(SHARED_RUNQ))
 		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
-		if (likely(sd))
-			sd = sd->parent;
-	}
+	else
+		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
-	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
+	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		while (sd) {
 			update_next_balance(sd, &next_balance);
+			sd = sd->child;
+		}
+
 		rcu_read_unlock();
 
 		goto out;
 	}
+
+	if (sched_feat(SHARED_RUNQ)) {
+		struct sched_domain *tmp = sd;
+
+		t0 = sched_clock_cpu(this_cpu);
+
+		/* Do update_next_balance() for all domains within LLC */
+		while (tmp) {
+			update_next_balance(tmp, &next_balance);
+			tmp = tmp->child;
+		}
+
+		pulled_task = shared_runq_pick_next_task(this_rq, rf);
+		if (pulled_task) {
+			if (sd) {
+				curr_cost = sched_clock_cpu(this_cpu) - t0;
+				/*
+				 * Will help bail out of scans of higer domains
+				 * slightly earlier.
+				 */
+				update_newidle_cost(sd, curr_cost);
+			}
+
+			rcu_read_unlock();
+			goto out_swq;
+		}
+
+		if (sd) {
+			t1 = sched_clock_cpu(this_cpu);
+			curr_cost += t1 - t0;
+			update_newidle_cost(sd, curr_cost);
+		}
+
+		/*
+		 * Since shared_runq_pick_next_task() can take a while
+		 * check if the CPU was targetted for a wakeup in the
+		 * meantime.
+		 */
+		if (this_rq->ttwu_pending) {
+			rcu_read_unlock();
+			return 0;
+		}
+	}
 	rcu_read_unlock();
 
+	/*
+	 * This is OK, because current is on_cpu, which avoids it being picked
+	 * for load-balance and preemption/IRQs are still disabled avoiding
+	 * further scheduler activity on it and we're being very careful to
+	 * re-start the picking loop.
+	 */
+	rq_unpin_lock(this_rq, rf);
 	raw_spin_rq_unlock(this_rq);
 
 	t0 = sched_clock_cpu(this_cpu);
@@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
+		/*
+		 * Skip <= LLC domains as they likely won't have any tasks if the
+		 * shared runq is empty.
+		 */
+		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
+			continue;
+
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
 			pulled_task = load_balance(this_cpu, this_rq,
@@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_lock(this_rq);
 
+out_swq:
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
  2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
@ 2023-08-31 10:45               ` K Prateek Nayak
  2023-08-31 19:11                 ` David Vernet
  2 siblings, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 10:45 UTC (permalink / raw)
  To: void
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team, kprateek.nayak

Even with the two patches, I still observe the following lock
contention when profiling the tbench 128-clients run with IBS:

  -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
     - 10.94% native_queued_spin_lock_slowpath
        - 10.73% _raw_spin_lock
           - 9.57% __schedule
                schedule_idle
                do_idle
              + cpu_startup_entry
           - 0.82% task_rq_lock
                newidle_balance
                pick_next_task_fair
                __schedule
                schedule_idle
                do_idle
              + cpu_startup_entry

Since David mentioned rq->avg_idle check is probably not the right step
towards the solution, this experiment introduces a per-shard
"overload" flag. Similar to "rq->rd->overload", per-shard overload flag
notifies of the possibility of one or more rq covered in the shard's
domain having a queued task. shard's overload flag is set at the same
time as "rq->rd->overload", and is cleared when shard's list is found
to be empty.

With these changes, following are the results for tbench 128-clients:

tip				: 1.00 (var: 1.00%)
tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 13 +++++++++++--
 kernel/sched/sched.h | 17 +++++++++++++++++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 446ffdad49e1..31fe109fdaf0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,7 @@ static void shared_runq_reassign_domains(void)
 		rq->cfs.shared_runq = shared_runq;
 		rq->cfs.shard = &shared_runq->shards[shard_idx];
 		rq_unlock(rq, &rf);
+		WRITE_ONCE(rq->cfs.shard->overload, 0);
 	}
 }
 
@@ -202,6 +203,7 @@ static void __shared_runq_drain(struct shared_runq *shared_runq)
 		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
 			list_del_init(&p->shared_runq_node);
 		raw_spin_unlock(&shard->lock);
+		WRITE_ONCE(shard->overload, 0);
 	}
 }
 
@@ -258,13 +260,20 @@ shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	struct task_struct *p;
 
-	if (list_empty(&shard->list))
+	if (!READ_ONCE(shard->overload))
 		return NULL;
 
+	if (list_empty(&shard->list)) {
+		WRITE_ONCE(shard->overload, 0);
+		return NULL;
+	}
+
 	raw_spin_lock(&shard->lock);
 	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, target))
+	if (!p)
+		WRITE_ONCE(shard->overload, 0);
+	else if (is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f50176f720b1..e8d4d948f742 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -601,6 +601,20 @@ do {									\
 struct shared_runq_shard {
 	struct list_head list;
 	raw_spinlock_t lock;
+	/*
+	 * shared_runq_shard can contain running tasks.
+	 * In such cases where all the tasks are running,
+	 * it is futile to attempt to pull tasks from the
+	 * list. Overload flag is used to indicate case
+	 * where one or more rq in the shard domain may
+	 * have a queued task. If the flag is 0, it is
+	 * very likely that all tasks in the shard are
+	 * running and cannot be migrated. This is not
+	 * guarded by the shard lock, and since it may
+	 * be updated often, it is placed into its own
+	 * cacheline.
+	 */
+	int overload ____cacheline_aligned;
 } ____cacheline_aligned;
 
 /* This would likely work better as a configurable knob via debugfs */
@@ -2585,6 +2599,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 		if (!READ_ONCE(rq->rd->overload))
 			WRITE_ONCE(rq->rd->overload, 1);
+
+		if (rq->cfs.shard && !READ_ONCE(rq->cfs.shard->overload))
+			WRITE_ONCE(rq->cfs.shard->overload, 1);
 	}
 #endif
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-31  0:01     ` David Vernet
@ 2023-08-31 10:45       ` Chen Yu
  2023-08-31 19:14         ` David Vernet
  0 siblings, 1 reply; 52+ messages in thread
From: Chen Yu @ 2023-08-31 10:45 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team, tim.c.chen

On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> > 
> > 5. Check the L2 cache miss rate.
> > perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> > The results show that the L2 cache miss rate is nearly the same with/without
> > shared_runqueue enabled.
> 
> As mentioned below, I expect it would be interesting to also collect
> icache / iTLB numbers. In my experience, poor uop cache locality will
> also result in poor icache locality, though of course that depends on a
> lot of other factors like alignment, how many (un)conditional branches
> you have within some byte window, etc. If alignment, etc were the issue
> though, we'd likely observe this also without SHARED_RUNQ.
>

[snip...] 

> 
> Interesting. As mentioned above, I expect we also see an increase in
> iTLB and icache misses?
> 

This is a good point, according to the perf topdown:

SHARED_RUNQ is disabled:

     13.0 %  tma_frontend_bound
      6.7 %  tma_fetch_latency
       0.3 %  tma_itlb_misses
       0.7 %  tma_icache_misses

itlb miss ratio is 13.0% * 6.7% * 0.3%
icache miss ratio is 13.0% * 6.7% * 0.7%

SHARED_RUNQ is enabled:
     20.0 %  tma_frontend_bound
      11.6 %  tma_fetch_latency
       0.9 %  tma_itlb_misses
       0.5 %  tma_icache_misses

itlb miss ratio is 20.0% * 11.6% * 0.9%
icache miss ratio is 20.0% * 11.6% * 0.5%

So both icache and itlb miss ratio increase, and itlb miss increases more,
although the bottleneck is neither icache nor itlb.
And as you mentioned below, it depends on other factors, including the hardware
settings, icache size, tlb size, DSB size, eg.

> This is something we deal with in HHVM. Like any other JIT engine /
> compiler, it is heavily front-end CPU bound, and has very poor icache,
> iTLB, and uop cache locality (also lots of branch resteers, etc).
> SHARED_RUNQ actually helps this workload quite a lot, as explained in
> the cover letter for the patch series. It makes sense that it would: uop
> locality is really bad even without increasing CPU util. So we have no
> reason not to migrate the task and hop on a CPU.
>

I see, this makes sense.
 
> > I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> > say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> > on the per-LLC shared runqueue? In this way it will not be migrated too offen
> > so as to keep its locality(both in terms of L1/L2 cache and DSB).
> 
> I'm hesitant to apply such heuristics to the feature. As mentioned
> above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
> icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
> time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
> locality in general even when they're scheduled on the same core they
> were on before they were descheduled, so we observe better performance
> if the task is migrated to a fully idle core rather than e.g. its
> hypertwin if it's available. That's not something we can guarantee with
> SHARED_RUNQ, but it hopefully illustrates the point that it's an example
> of a workload that would suffer with such a heuristic.
>

OK, the hackbench is just a microbenchmark to help us evaluate
what the impact SHARED_RUNQ could bring. If such heuristic heals
hackbench but hurts the real workload then we can consider
other direction.
 
> Another point to consider is that performance implications that are a
> result of Intel micro architectural details don't necessarily apply to
> everyone. I'm not as familiar with the instruction decode pipeline on
> AMD chips like Zen4. I'm sure they have a uop cache, but the size of
> that cache, alignment requirements, the way that cache interfaces with
> e.g. their version of the MITE / decoder, etc, are all going to be quite
> different.
>

Yes, this is true.
 
> In general, I think it's difficult for heuristics like this to suit all
> possible workloads or situations (not that you're claiming it is). My
> preference is to keep it as is so that it's easier for users to build a
> mental model of what outcome they should expect if they use the feature.
> Put another way: As a user of this feature, I'd be a lot more surprised
> to see that I enabled it and CPU util stayed low, vs. enabling it and
> seeing higher CPU util, but also degraded icache / iTLB locality.
>

Understand.
 
> Let me know what you think, and thanks again for investing your time
> into this.
> 

Let me run other benchmarks to see if others are sensitive to
the resource locality.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
  2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
@ 2023-08-31 18:45                 ` David Vernet
  2023-08-31 19:47                   ` K Prateek Nayak
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-31 18:45 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
> This patch takes the relevant optimizations from [1] in
> newidle_balance(). Following is the breakdown:

Thanks for working on this. I think the fix you added for skipping <=
LLC domains makes sense. The others possibly as well -- left some
comments below!

> 
> - Check "rq->rd->overload" before jumping into newidle_balance, even
>   with SHARED_RQ feat enabled.

Out of curiosity -- did you observe this making a material difference in
your tests? After thinking about it some more, though I see the argument
for why it would be logical to check if we're overloaded, I'm still
thinking that it's more ideal to just always check the SHARED_RUNQ.
rd->overload is only set in find_busiest_group() when we load balance,
so I worry that having SHARED_RUNQ follow rd->overload may just end up
making it redundant with normal load balancing in many cases.

So yeah, while I certainly understand the idea (and would like to better
understand what kind of difference it made in your tests), I still feel
pretty strongly that SHARED_RUNQ makes the most sense as a feature when
it ignores all of these heuristics and just tries to maximize work
conservation.

What do you think?

> - Call update_next_balance() for all the domains till MC Domain in
>   when SHARED_RQ path is taken.

I _think_ this makes sense. Though even in this case, I feel that it may
be slightly confusing and/or incorrect to push back the balance time
just because we didn't find a task in our current CCX's shared_runq.
Maybe we should avoid mucking with load balancing? Not sure, but I am
leaning towards what you're proposing here as a better approach.

> - Account cost from shared_runq_pick_next_task() and update
>   curr_cost and sd->max_newidle_lb_cost accordingly.

Yep, I think this is the correct thing to do.

> 
> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>   shared_runq_pick_next_task() is responsible for calling
>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>   everything is right with LOCKDEP)

Still need to think more about this, but it's purely just tactical and
can easily be fixed it we need.

> 
> - Includes a fix to skip directly above the LLC domain when calling the
>   load_balance() in newidle_balance()

Big fix, thanks again for noticing it.

> All other surgery from [1] has been removed.
> 
> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 67 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf844ffa79c2..446ffdad49e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);
>  
>  	return ret;
>  }
> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
> +
> +	if (sched_feat(SHARED_RUNQ)) {
> +		struct sched_domain *tmp = sd;
> +
> +		t0 = sched_clock_cpu(this_cpu);
> +
> +		/* Do update_next_balance() for all domains within LLC */
> +		while (tmp) {
> +			update_next_balance(tmp, &next_balance);
> +			tmp = tmp->child;
> +		}
> +
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			if (sd) {
> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
> +				/*
> +				 * Will help bail out of scans of higer domains
> +				 * slightly earlier.
> +				 */
> +				update_newidle_cost(sd, curr_cost);
> +			}
> +
> +			rcu_read_unlock();
> +			goto out_swq;
> +		}
> +
> +		if (sd) {
> +			t1 = sched_clock_cpu(this_cpu);
> +			curr_cost += t1 - t0;
> +			update_newidle_cost(sd, curr_cost);
> +		}
> +
> +		/*
> +		 * Since shared_runq_pick_next_task() can take a while
> +		 * check if the CPU was targetted for a wakeup in the
> +		 * meantime.
> +		 */
> +		if (this_rq->ttwu_pending) {
> +			rcu_read_unlock();
> +			return 0;
> +		}

At first I was wondering whether we should do this above
update_newidle_cost(), but I think it makes sense to always call
update_newidle_cost() after we've failed to get a task from
shared_runq_pick_next_task().

> +	}
>  	rcu_read_unlock();
>  
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);

Don't you need to do this before you exit on the rq->ttwu_pending path?

>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;
> +
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>  
>  			pulled_task = load_balance(this_cpu, this_rq,
> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  


Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
@ 2023-08-31 19:11                 ` David Vernet
  2023-08-31 20:23                   ` K Prateek Nayak
  2023-09-27  4:23                   ` K Prateek Nayak
  0 siblings, 2 replies; 52+ messages in thread
From: David Vernet @ 2023-08-31 19:11 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:

Hi Prateek,

> Even with the two patches, I still observe the following lock
> contention when profiling the tbench 128-clients run with IBS:
> 
>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      - 10.94% native_queued_spin_lock_slowpath
>         - 10.73% _raw_spin_lock
>            - 9.57% __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
>            - 0.82% task_rq_lock
>                 newidle_balance
>                 pick_next_task_fair
>                 __schedule
>                 schedule_idle
>                 do_idle
>               + cpu_startup_entry
> 
> Since David mentioned rq->avg_idle check is probably not the right step
> towards the solution, this experiment introduces a per-shard
> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> notifies of the possibility of one or more rq covered in the shard's
> domain having a queued task. shard's overload flag is set at the same
> time as "rq->rd->overload", and is cleared when shard's list is found
> to be empty.

I think this is an interesting idea, but I feel that it's still working
against the core proposition of SHARED_RUNQ, which is to enable work
conservation.

> With these changes, following are the results for tbench 128-clients:

Just to make sure I understand, this is to address the contention we're
observing on tbench with 64 - 256 clients, right?  That's my
understanding from Gautham's reply in [0].

[0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/

If so, are we sure this change won't regress other workloads that would
have benefited from the work conservation?

Also, I assume that you don't see the improved contention without this,
even if you include your fix to the newidle_balance() that has us skip
over the <= LLC domain?

Thanks,
David

P.S. Taking off on vacation now, so any replies will be very delayed.
Thanks again for working on this!

> 
> tip				: 1.00 (var: 1.00%)
> tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
> tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c  | 13 +++++++++++--
>  kernel/sched/sched.h | 17 +++++++++++++++++
>  2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 446ffdad49e1..31fe109fdaf0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -186,6 +186,7 @@ static void shared_runq_reassign_domains(void)
>  		rq->cfs.shared_runq = shared_runq;
>  		rq->cfs.shard = &shared_runq->shards[shard_idx];
>  		rq_unlock(rq, &rf);
> +		WRITE_ONCE(rq->cfs.shard->overload, 0);
>  	}
>  }
>  
> @@ -202,6 +203,7 @@ static void __shared_runq_drain(struct shared_runq *shared_runq)
>  		list_for_each_entry_safe(p, tmp, &shard->list, shared_runq_node)
>  			list_del_init(&p->shared_runq_node);
>  		raw_spin_unlock(&shard->lock);
> +		WRITE_ONCE(shard->overload, 0);
>  	}
>  }
>  
> @@ -258,13 +260,20 @@ shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>  {
>  	struct task_struct *p;
>  
> -	if (list_empty(&shard->list))
> +	if (!READ_ONCE(shard->overload))
>  		return NULL;
>  
> +	if (list_empty(&shard->list)) {
> +		WRITE_ONCE(shard->overload, 0);
> +		return NULL;
> +	}
> +
>  	raw_spin_lock(&shard->lock);
>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>  				     shared_runq_node);
> -	if (p && is_cpu_allowed(p, target))
> +	if (!p)
> +		WRITE_ONCE(shard->overload, 0);
> +	else if (is_cpu_allowed(p, target))
>  		list_del_init(&p->shared_runq_node);
>  	else
>  		p = NULL;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f50176f720b1..e8d4d948f742 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -601,6 +601,20 @@ do {									\
>  struct shared_runq_shard {
>  	struct list_head list;
>  	raw_spinlock_t lock;
> +	/*
> +	 * shared_runq_shard can contain running tasks.
> +	 * In such cases where all the tasks are running,
> +	 * it is futile to attempt to pull tasks from the
> +	 * list. Overload flag is used to indicate case
> +	 * where one or more rq in the shard domain may
> +	 * have a queued task. If the flag is 0, it is
> +	 * very likely that all tasks in the shard are
> +	 * running and cannot be migrated. This is not
> +	 * guarded by the shard lock, and since it may
> +	 * be updated often, it is placed into its own
> +	 * cacheline.
> +	 */
> +	int overload ____cacheline_aligned;
>  } ____cacheline_aligned;
>  
>  /* This would likely work better as a configurable knob via debugfs */
> @@ -2585,6 +2599,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
>  	if (prev_nr < 2 && rq->nr_running >= 2) {
>  		if (!READ_ONCE(rq->rd->overload))
>  			WRITE_ONCE(rq->rd->overload, 1);
> +
> +		if (rq->cfs.shard && !READ_ONCE(rq->cfs.shard->overload))
> +			WRITE_ONCE(rq->cfs.shard->overload, 1);
>  	}
>  #endif
>  
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-31 10:45       ` Chen Yu
@ 2023-08-31 19:14         ` David Vernet
  2023-09-23  6:35           ` Chen Yu
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-08-31 19:14 UTC (permalink / raw)
  To: Chen Yu
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team, tim.c.chen

On Thu, Aug 31, 2023 at 06:45:11PM +0800, Chen Yu wrote:
> On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> > On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> > > 
> > > 5. Check the L2 cache miss rate.
> > > perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> > > The results show that the L2 cache miss rate is nearly the same with/without
> > > shared_runqueue enabled.
> > 
> > As mentioned below, I expect it would be interesting to also collect
> > icache / iTLB numbers. In my experience, poor uop cache locality will
> > also result in poor icache locality, though of course that depends on a
> > lot of other factors like alignment, how many (un)conditional branches
> > you have within some byte window, etc. If alignment, etc were the issue
> > though, we'd likely observe this also without SHARED_RUNQ.
> >
> 
> [snip...] 
> 
> > 
> > Interesting. As mentioned above, I expect we also see an increase in
> > iTLB and icache misses?
> > 
> 
> This is a good point, according to the perf topdown:
> 
> SHARED_RUNQ is disabled:
> 
>      13.0 %  tma_frontend_bound
>       6.7 %  tma_fetch_latency
>        0.3 %  tma_itlb_misses
>        0.7 %  tma_icache_misses
> 
> itlb miss ratio is 13.0% * 6.7% * 0.3%
> icache miss ratio is 13.0% * 6.7% * 0.7%
> 
> SHARED_RUNQ is enabled:
>      20.0 %  tma_frontend_bound
>       11.6 %  tma_fetch_latency
>        0.9 %  tma_itlb_misses
>        0.5 %  tma_icache_misses
> 
> itlb miss ratio is 20.0% * 11.6% * 0.9%
> icache miss ratio is 20.0% * 11.6% * 0.5%
> 
> So both icache and itlb miss ratio increase, and itlb miss increases more,
> although the bottleneck is neither icache nor itlb.
> And as you mentioned below, it depends on other factors, including the hardware
> settings, icache size, tlb size, DSB size, eg.

Thanks for collecting these stats. Good to know that things are making
sense and the data we're collecting are telling a consistent story.

> > This is something we deal with in HHVM. Like any other JIT engine /
> > compiler, it is heavily front-end CPU bound, and has very poor icache,
> > iTLB, and uop cache locality (also lots of branch resteers, etc).
> > SHARED_RUNQ actually helps this workload quite a lot, as explained in
> > the cover letter for the patch series. It makes sense that it would: uop
> > locality is really bad even without increasing CPU util. So we have no
> > reason not to migrate the task and hop on a CPU.
> >
> 
> I see, this makes sense.
>  
> > > I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> > > say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> > > on the per-LLC shared runqueue? In this way it will not be migrated too offen
> > > so as to keep its locality(both in terms of L1/L2 cache and DSB).
> > 
> > I'm hesitant to apply such heuristics to the feature. As mentioned
> > above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
> > icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
> > time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
> > locality in general even when they're scheduled on the same core they
> > were on before they were descheduled, so we observe better performance
> > if the task is migrated to a fully idle core rather than e.g. its
> > hypertwin if it's available. That's not something we can guarantee with
> > SHARED_RUNQ, but it hopefully illustrates the point that it's an example
> > of a workload that would suffer with such a heuristic.
> >
> 
> OK, the hackbench is just a microbenchmark to help us evaluate
> what the impact SHARED_RUNQ could bring. If such heuristic heals
> hackbench but hurts the real workload then we can consider
> other direction.
>  
> > Another point to consider is that performance implications that are a
> > result of Intel micro architectural details don't necessarily apply to
> > everyone. I'm not as familiar with the instruction decode pipeline on
> > AMD chips like Zen4. I'm sure they have a uop cache, but the size of
> > that cache, alignment requirements, the way that cache interfaces with
> > e.g. their version of the MITE / decoder, etc, are all going to be quite
> > different.
> >
> 
> Yes, this is true.
>  
> > In general, I think it's difficult for heuristics like this to suit all
> > possible workloads or situations (not that you're claiming it is). My
> > preference is to keep it as is so that it's easier for users to build a
> > mental model of what outcome they should expect if they use the feature.
> > Put another way: As a user of this feature, I'd be a lot more surprised
> > to see that I enabled it and CPU util stayed low, vs. enabling it and
> > seeing higher CPU util, but also degraded icache / iTLB locality.
> >
> 
> Understand.
>  
> > Let me know what you think, and thanks again for investing your time
> > into this.
> > 
> 
> Let me run other benchmarks to see if others are sensitive to
> the resource locality.

Great, thank you Chenyu.

FYI, I'll be on vacation for over a week starting later today, so my
responses may be delayed.

Thanks in advance for working on this. Looking forward to seeing the
results when I'm back at work.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance
  2023-08-31 18:45                 ` David Vernet
@ 2023-08-31 19:47                   ` K Prateek Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 19:47 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

Thank you for taking a look at this despite being on vacation.

On 9/1/2023 12:15 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
>> This patch takes the relevant optimizations from [1] in
>> newidle_balance(). Following is the breakdown:
> 
> Thanks for working on this. I think the fix you added for skipping <=
> LLC domains makes sense. The others possibly as well

I too am in doubt with some of them but I left them in since I was
building on top of the cumulative diff.

> -- left some
> comments below!
> 
>>
>> - Check "rq->rd->overload" before jumping into newidle_balance, even
>>   with SHARED_RQ feat enabled.
> 
> Out of curiosity -- did you observe this making a material difference in
> your tests? After thinking about it some more, though I see the argument
> for why it would be logical to check if we're overloaded, I'm still
> thinking that it's more ideal to just always check the SHARED_RUNQ.
> rd->overload is only set in find_busiest_group() when we load balance,
> so I worry that having SHARED_RUNQ follow rd->overload may just end up
> making it redundant with normal load balancing in many cases.
> 
> So yeah, while I certainly understand the idea (and would like to better
> understand what kind of difference it made in your tests), I still feel
> pretty strongly that SHARED_RUNQ makes the most sense as a feature when
> it ignores all of these heuristics and just tries to maximize work
> conservation.
> 
> What do you think?

Actually, as it turns out, it was probably a combination of the
rq->avg_idle check + updating of cost that got the performance back
during experimenting. In Patch 3, I've give the results with this patch
alone and it makes no difference, for tbench 128-client at least. There
is the same rq lock contention I mentioned previously which is why the
per-shard "overload" flag.

Based on Anna-Maria's observation in [1], we have a short idling, spread
across the system with tbench. Now it is possible we are doing a
newidle_balance() when it would have been better off to let the CPU idle
for that short duration without and not cause a contention for the rq
lock.

[1] https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de/

> 
>> - Call update_next_balance() for all the domains till MC Domain in
>>   when SHARED_RQ path is taken.
> 
> I _think_ this makes sense. Though even in this case, I feel that it may
> be slightly confusing and/or incorrect to push back the balance time
> just because we didn't find a task in our current CCX's shared_runq.
> Maybe we should avoid mucking with load balancing? Not sure, but I am
> leaning towards what you're proposing here as a better approach.

This requires a deeper look and more testing yes.

> 
>> - Account cost from shared_runq_pick_next_task() and update
>>   curr_cost and sd->max_newidle_lb_cost accordingly.
> 
> Yep, I think this is the correct thing to do.
> 
>>
>> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>>   shared_runq_pick_next_task() is responsible for calling
>>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>>   everything is right with LOCKDEP)
> 
> Still need to think more about this, but it's purely just tactical and
> can easily be fixed it we need.

I agree. I'll leave the full picture of this below in
[Locking code movement clarifications] since we seem to keep coming back
to this and it would be good to have more eyes on what is going on in my
mind :)

> 
>>
>> - Includes a fix to skip directly above the LLC domain when calling the
>>   load_balance() in newidle_balance()
> 
> Big fix, thanks again for noticing it.
> 
>> All other surgery from [1] has been removed.
>>
>> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 67 insertions(+), 27 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bf844ffa79c2..446ffdad49e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
>>  
>>  	return ret;
>>  }
>> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
>> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>> +
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		struct sched_domain *tmp = sd;
>> +
>> +		t0 = sched_clock_cpu(this_cpu);
>> +
>> +		/* Do update_next_balance() for all domains within LLC */
>> +		while (tmp) {
>> +			update_next_balance(tmp, &next_balance);
>> +			tmp = tmp->child;
>> +		}
>> +
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			if (sd) {
>> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +				/*
>> +				 * Will help bail out of scans of higer domains
>> +				 * slightly earlier.
>> +				 */
>> +				update_newidle_cost(sd, curr_cost);
>> +			}
>> +
>> +			rcu_read_unlock();
>> +			goto out_swq;
>> +		}
>> +
>> +		if (sd) {
>> +			t1 = sched_clock_cpu(this_cpu);
>> +			curr_cost += t1 - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +		}
>> +
>> +		/*
>> +		 * Since shared_runq_pick_next_task() can take a while
>> +		 * check if the CPU was targetted for a wakeup in the
>> +		 * meantime.
>> +		 */
>> +		if (this_rq->ttwu_pending) {
>> +			rcu_read_unlock();
>> +			return 0;
>> +		}
> 
> At first I was wondering whether we should do this above
> update_newidle_cost(), but I think it makes sense to always call
> update_newidle_cost() after we've failed to get a task from
> shared_runq_pick_next_task().

Indeed. I think the cost might be useful to be accounted for.

> 
>> +	}
>>  	rcu_read_unlock();
>>  
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
> 
> Don't you need to do this before you exit on the rq->ttwu_pending path?

[Locking code movement clarifications]

Okay this is where I'll put all the locking bits I have in my head:

o First, the removal of rq_repin_lock() in shared_runq_pick_next_task()

  Since this is only called from newidle_balance(), it is easy to
  isolate the changes. shared_runq_pick_next_task() can return either
  0, 1 or -1. The interpretation is same as return value of
  newidle_balance():

   0: Unsuccessful at pulling task but the rq lock was never released
      and reacquired - it was held all the time.

   1: Task was pulled successfully. The rq lock was released and
      reacquired in the process but now, with the above changes, it is
      not pinned.

  -1: Unsuccessful at pulling task but the rq lock was released and
      reacquired in the process and now, with the above changes, it is
      not pinned.

  Now the following block:

	pulled_task = shared_runq_pick_next_task(this_rq, rf);
	if (pulled_task) {
		...
		goto out_swq;
	}

  takes care of the case where return values are -1, or 1. The "out_swq"
  label is almost towards the end of newidle_balance() and just before
  returning, the newidle_balance() does:

	rq_repin_lock(this_rq, rf);

  So this path will repin the lock.

  Now for the case where shared_runq_pick_next_task() return 0.

o Which brings us to the question you asked above

  newidle_balance() is called with the rq lock held and pinned, and it
  expects the same when newidle_balance() reruns. The very first bailout
  check in newidle_balance() is:

	if (this_rq->ttwu_pending)
		return 0;

  so we return without doing any changed to the state of rq lock.

  Coming to the above changes, if we have to hit the ttwu_pending
  bailout you pointed at, shared_runq_pick_next_task() should return 0,
  signifying no modification to state of the lock or pinning. Then we
  update the cost, and come to ttwu_pending check. We still have the
  lock held, and it is pinned. Thus we do not need to unpin the lock
  since we newidle_balance() is expected to return with lock held and
  it being pinned.

Please let me know if I've missed something.

> 
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
>> +
>>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>>  
>>  			pulled_task = load_balance(this_cpu, this_rq,
>> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
> 
> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-08-31 19:11                 ` David Vernet
@ 2023-08-31 20:23                   ` K Prateek Nayak
  2023-09-29 17:01                     ` David Vernet
  2023-09-27  4:23                   ` K Prateek Nayak
  1 sibling, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-08-31 20:23 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> 
> Hi Prateek,
> 
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>      - 10.94% native_queued_spin_lock_slowpath
>>         - 10.73% _raw_spin_lock
>>            - 9.57% __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>            - 0.82% task_rq_lock
>>                 newidle_balance
>>                 pick_next_task_fair
>>                 __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
> 
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.

I don't think so! Work conservation is possible if there is an
imbalance. Consider the case where we 15 tasks in the shared_runq but we
have 16 CPUs, 15 of which are running these 15 tasks, and one going
idle. Work is conserved. What we need to worry about is tasks being in
the shared_runq that are queued on their respective CPU. This can only
happen if any one of the rq has nr_running >= 2, which is also the point
we are setting "shard->overload".

Now situation can change later and all tasks in the shared_runq might be
running on respective CPUs but "shard->overload" is only cleared when
the shared_runq becomes empty. If this is too late, maybe we can clear
it if periodic load balancing finds no queuing (somewhere around the
time we update nr_idle_scan).

So the window where we do not go ahead with popping a task from the
shared_runq_shard->list is between the list being empty and at least one
of the CPU associated with the shard reporting nr_running >= 2, which is
when work conservation is needed.

> 
>> With these changes, following are the results for tbench 128-clients:
> 
> Just to make sure I understand, this is to address the contention we're
> observing on tbench with 64 - 256 clients, right?  That's my
> understanding from Gautham's reply in [0].
> 
> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/

I'm not sure if Gautham saw a contention with IBS but he did see an
abnormal blowup in newidle_balance() counts which he suspected were the
cause for the regression. I noticed the rq lock contention after doing a
fair bit of surgery. Let me go check if that was the case with vanilla
v3 too.

> 
> If so, are we sure this change won't regress other workloads that would
> have benefited from the work conservation?

I don't think we'll regress any workloads as I explained above because
the "overload" flag being 0 almost (since update/access is not atomic)
always indicate a case where the tasks cannot be pulled. However, that
needs to be tested since there is a small behavior change in
shared_runq_pick_next_task(). Where previously if the task is running
on CPU, we would have popped it from shared_runq, did some lock
fiddling before finding out it is running, some more lock fiddling
before the function returned "-1", now with the changes here, it'll
simply return a "0" and although that is correct, we have seen some
interesting cases in past [1] where a random lock contention actually
helps certain benchmarks ¯\_(ツ)_/¯

[1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 

> 
> Also, I assume that you don't see the improved contention without this,
> even if you include your fix to the newidle_balance() that has us skip
> over the <= LLC domain?

No improvements! The lock is still very much contended for. I wonder if
it could be because of the unlocking and locking the rq again in
shared_runq_pick_next_task() even when task is on CPU. Also since it
return -1 for this case, pick_next_task_fair() will return RETRY_TASK
which can have further implications.

> 
> Thanks,
> David
> 
> P.S. Taking off on vacation now, so any replies will be very delayed.
> Thanks again for working on this!

Hope you have a great vacation :)

> 
>>
>> tip				: 1.00 (var: 1.00%)
>> tip + v3 + series till patch 2	: 0.41 (var: 1.15%) (diff: -58.81%)
>> tip + v3 + full series		: 1.01 (var: 0.36%) (diff: +00.92%)
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> [..snip..]
>>

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues
  2023-08-31 19:14         ` David Vernet
@ 2023-09-23  6:35           ` Chen Yu
  0 siblings, 0 replies; 52+ messages in thread
From: Chen Yu @ 2023-09-23  6:35 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, kprateek.nayak, aaron.lu,
	wuyun.abel, kernel-team, tim.c.chen

On 2023-08-31 at 14:14:44 -0500, David Vernet wrote:
> On Thu, Aug 31, 2023 at 06:45:11PM +0800, Chen Yu wrote:
> > On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> > > On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
[snip...]
> > 
> > Let me run other benchmarks to see if others are sensitive to
> > the resource locality.
> 
> Great, thank you Chenyu.
> 
> FYI, I'll be on vacation for over a week starting later today, so my
> responses may be delayed.
> 
> Thanks in advance for working on this. Looking forward to seeing the
> results when I'm back at work.

Sorry for late result. I applied your latest patch set on top of upstream
6.6-rc2 Commit 27bbf45eae9c(I pulled the latest commit from upstream yesterday).
The good news is that, there is overall slight stable improvement on tbench,
and no obvious regression on other benchmarks is observed on Sapphire Rapids
with 224 CPUs:

tbench throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  0.85)	 +4.35 (  0.23)
loopback        	112-threads	 1.00 (  0.38)	 +0.91 (  0.05)
loopback        	168-threads	 1.00 (  0.03)	 +2.96 (  0.06)
loopback        	224-threads	 1.00 (  0.09)	 +2.95 (  0.05)
loopback        	280-threads	 1.00 (  0.12)	 +2.48 (  0.25)
loopback        	336-threads	 1.00 (  0.23)	 +2.54 (  0.14)
loopback        	392-threads	 1.00 (  0.53)	 +2.91 (  0.04)
loopback        	448-threads	 1.00 (  0.10)	 +2.76 (  0.07)

schbench  99.0th tail latency
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.32)	 +0.68 (  0.32)
normal          	2-mthreads	 1.00 (  1.83)	 +4.48 (  3.31)
normal          	4-mthreads	 1.00 (  0.83)	 -0.59 (  1.80)
normal          	8-mthreads	 1.00 (  4.47)	 -1.07 (  3.49)

netperf  throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  1.01)	 +1.37 (  1.41)
TCP_RR          	112-threads	 1.00 (  2.44)	 -0.94 (  2.63)
TCP_RR          	168-threads	 1.00 (  2.94)	 +3.22 (  4.63)
TCP_RR          	224-threads	 1.00 (  2.38)	 +2.83 (  3.62)
TCP_RR          	280-threads	 1.00 ( 66.07)	 -7.26 ( 78.95)
TCP_RR          	336-threads	 1.00 ( 21.92)	 -0.50 ( 21.48)
TCP_RR          	392-threads	 1.00 ( 34.31)	 -0.00 ( 33.08)
TCP_RR          	448-threads	 1.00 ( 43.33)	 -0.31 ( 43.82)
UDP_RR          	56-threads	 1.00 (  8.78)	 +3.84 (  9.38)
UDP_RR          	112-threads	 1.00 ( 14.15)	 +1.84 (  8.35)
UDP_RR          	168-threads	 1.00 (  5.10)	 +2.95 (  8.85)
UDP_RR          	224-threads	 1.00 ( 15.13)	 +2.76 ( 14.11)
UDP_RR          	280-threads	 1.00 ( 15.14)	 +2.14 ( 16.75)
UDP_RR          	336-threads	 1.00 ( 25.85)	 +1.64 ( 27.42)
UDP_RR          	392-threads	 1.00 ( 34.34)	 +0.40 ( 34.20)
UDP_RR          	448-threads	 1.00 ( 41.87)	 +1.41 ( 41.22)

We can have a re-run after the latest one is released.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-08-31 19:11                 ` David Vernet
  2023-08-31 20:23                   ` K Prateek Nayak
@ 2023-09-27  4:23                   ` K Prateek Nayak
  2023-09-27  6:59                     ` Chen Yu
  2023-09-27 13:08                     ` David Vernet
  1 sibling, 2 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-09-27  4:23 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

Some more test results (although this might be slightly irrelevant with
next version around the corner)

On 9/1/2023 12:41 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> 
> Hi Prateek,
> 
>> Even with the two patches, I still observe the following lock
>> contention when profiling the tbench 128-clients run with IBS:
>>
>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>      - 10.94% native_queued_spin_lock_slowpath
>>         - 10.73% _raw_spin_lock
>>            - 9.57% __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>            - 0.82% task_rq_lock
>>                 newidle_balance
>>                 pick_next_task_fair
>>                 __schedule
>>                 schedule_idle
>>                 do_idle
>>               + cpu_startup_entry
>>
>> Since David mentioned rq->avg_idle check is probably not the right step
>> towards the solution, this experiment introduces a per-shard
>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>> notifies of the possibility of one or more rq covered in the shard's
>> domain having a queued task. shard's overload flag is set at the same
>> time as "rq->rd->overload", and is cleared when shard's list is found
>> to be empty.
> 
> I think this is an interesting idea, but I feel that it's still working
> against the core proposition of SHARED_RUNQ, which is to enable work
> conservation.
> 
I have some more numbers. This time I'm accounting the cost for peeking
into the shared-runq and have two variants - one that keeps the current
vanilla flow from your v3 and the other that moves the rq->avg_idle
check before looking at the shared-runq. Following are the results:

-> Without EEVDF

o tl;dr

- With avg_idle check, the improvements observed with shared-runq
  aren't as large but they are still noticeable.

- Most regressions are gone nad the others aren't as bad with the
  introduction of shared-runq

o Kernels

base			: tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
shared_runq		: base + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

o Benchmarks

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           base[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.64)     0.90 [ 10.20]( 8.79)             0.93 [  7.08]( 3.87)
 2-groups     1.00 [ -0.00]( 2.97)     0.85 [ 15.06]( 4.86)             0.96 [  4.47]( 2.22)
 4-groups     1.00 [ -0.00]( 1.84)     0.93 [  7.38]( 2.63)             0.94 [  6.07]( 1.02)
 8-groups     1.00 [ -0.00]( 1.24)     0.97 [  2.83]( 2.69)             0.98 [  1.82]( 1.01)
16-groups     1.00 [ -0.00]( 3.31)     1.03 [ -2.93]( 2.46)             1.02 [ -1.61]( 1.34)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 1.08)     0.98 [ -1.89]( 0.48)              0.99 [ -0.73]( 0.70)
    2     1.00 [  0.00]( 0.69)     0.99 [ -1.48]( 0.24)              0.98 [ -1.62]( 0.85)
    4     1.00 [  0.00]( 0.70)     0.97 [ -2.87]( 1.34)              0.98 [ -2.15]( 0.48)
    8     1.00 [  0.00]( 0.85)     0.97 [ -3.17]( 1.56)              0.99 [ -1.32]( 1.09)
   16     1.00 [  0.00]( 2.18)     0.91 [ -8.70]( 0.27)              0.98 [ -2.03]( 1.28)
   32     1.00 [  0.00]( 3.84)     0.51 [-48.53]( 2.52)              1.01 [  1.41]( 3.83)
   64     1.00 [  0.00]( 7.06)     0.38 [-62.49]( 1.89)              1.05 [  5.33]( 4.09)
  128     1.00 [  0.00]( 0.88)     0.41 [-58.92]( 0.28)              1.01 [  0.54]( 1.65)
  256     1.00 [  0.00]( 0.88)     0.97 [ -2.56]( 1.78)              1.00 [ -0.48]( 0.33)
  512     1.00 [  0.00]( 0.07)     1.00 [  0.06]( 0.04)              0.98 [ -1.51]( 0.44)
 1024     1.00 [  0.00]( 0.30)     0.99 [ -1.35]( 0.90)              1.00 [ -0.24]( 0.41)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 8.87)     1.00 [  0.31]( 5.27)                1.09 [  9.11]( 0.58)
Scale     1.00 [  0.00]( 6.80)     0.99 [ -0.81]( 7.20)                1.00 [  0.49]( 5.67)
  Add     1.00 [  0.00]( 7.24)     0.99 [ -1.13]( 7.02)                1.02 [  2.06]( 6.36)
Triad     1.00 [  0.00]( 5.00)     0.96 [ -4.11]( 9.37)                1.03 [  3.46]( 4.41)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:      base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.45)     1.00 [  0.32]( 1.88)              1.04 [  4.02]( 1.45)
Scale     1.00 [  0.00]( 4.40)     0.98 [ -1.76]( 6.46)              1.01 [  1.28]( 1.00)
  Add     1.00 [  0.00]( 4.97)     0.98 [ -1.85]( 6.01)              1.03 [  2.75]( 0.24)
Triad     1.00 [  0.00]( 0.24)     0.96 [ -3.82]( 6.41)              0.99 [ -1.10]( 4.47)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:        base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     0.98 [ -2.37]( 0.08)              0.99 [ -1.32]( 0.37)
 2-clients     1.00 [  0.00]( 0.75)     0.98 [ -2.04]( 0.33)              0.98 [ -1.57]( 0.50)
 4-clients     1.00 [  0.00]( 0.84)     0.97 [ -3.25]( 1.01)              0.99 [ -0.77]( 0.54)
 8-clients     1.00 [  0.00]( 0.78)     0.96 [ -4.18]( 0.68)              0.99 [ -0.77]( 0.63)
16-clients     1.00 [  0.00]( 2.56)     0.84 [-15.71]( 6.33)              1.00 [ -0.35]( 0.58)
32-clients     1.00 [  0.00]( 1.03)     0.35 [-64.92]( 8.90)              0.98 [ -1.92]( 1.67)
64-clients     1.00 [  0.00]( 2.69)     0.26 [-74.05]( 6.56)              0.98 [ -2.46]( 2.42)
128-clients    1.00 [  0.00]( 1.91)     0.25 [-74.81]( 3.67)              0.99 [ -1.50]( 2.15)
256-clients    1.00 [  0.00]( 2.21)     0.92 [ -7.73]( 2.29)              0.98 [ -1.51]( 1.85)
512-clients    1.00 [  0.00](45.18)     0.96 [ -4.06](52.89)              0.98 [ -2.49](49.22)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:        base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1             1.00 [ -0.00](12.03)     1.04 [ -4.35](34.64)              1.13 [-13.04]( 2.25)
  2             1.00 [ -0.00]( 9.36)     1.00 [ -0.00]( 4.56)              1.12 [-11.54](12.83)
  4             1.00 [ -0.00]( 1.95)     1.00 [ -0.00](13.36)              0.93 [  6.67]( 9.10)
  8             1.00 [ -0.00]( 9.01)     0.97 [  2.70]( 4.68)              1.03 [ -2.70](12.11)
 16             1.00 [ -0.00]( 3.08)     1.02 [ -2.00]( 3.01)              1.00 [ -0.00]( 7.33)
 32             1.00 [ -0.00]( 0.75)     1.03 [ -2.60]( 8.20)              1.09 [ -9.09]( 4.24)
 64             1.00 [ -0.00]( 2.15)     0.91 [  9.20]( 1.03)              1.01 [ -0.61]( 7.14)
128             1.00 [ -0.00]( 5.18)     1.05 [ -4.57]( 7.74)              1.01 [ -0.57]( 5.62)
256             1.00 [ -0.00]( 4.18)     1.06 [ -5.87](51.02)              1.10 [ -9.51](15.82)
512             1.00 [ -0.00]( 2.10)     1.03 [ -3.36]( 2.88)              1.06 [ -5.87]( 1.10)


==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
                                                base                  shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41407024.82 (   0.00%)    41211208.57 (  -0.47%)     41354094.87 (  -0.13%)
Hmean     unixbench-dhry2reg-512        6249629291.88 (   0.00%)  6245782129.00 (  -0.06%)   6236514875.97 (  -0.21%)
Amean     unixbench-syscall-1              2922580.63 (   0.00%)     2928021.57 *  -0.19%*      2895742.17 *   0.92%*
Amean     unixbench-syscall-512            7606400.73 (   0.00%)     8390396.33 * -10.31%*      8236409.00 *  -8.28%*
Hmean     unixbench-pipe-1                 2574942.54 (   0.00%)     2610825.75 *   1.39%*      2531492.38 *  -1.69%*
Hmean     unixbench-pipe-512             364489246.31 (   0.00%)   366388360.22 *   0.52%*    360160487.66 *  -1.19%*
Hmean     unixbench-spawn-1                   4428.94 (   0.00%)        4391.20 (  -0.85%)         4541.06 (   2.53%)
Hmean     unixbench-spawn-512                68883.47 (   0.00%)       69143.38 (   0.38%)        69776.01 *   1.30%*
Hmean     unixbench-execl-1                   3878.47 (   0.00%)        3861.63 (  -0.43%)         3873.96 (  -0.12%)
Hmean     unixbench-execl-512                11638.84 (   0.00%)       12758.38 *   9.62%*        14001.23 *  20.30%*


==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
tip                     : 1.00 (var: 1.41%)
shared_runq             : 0.99 (var: 0.84%)  (diff: -1.40%)
shared_runq_idle_check  : 1.00 (var: 0.79%)  (diff:  0.00%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to base
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.39%              -0.09%
2CDD            2       0%       -0.31%              -1.73%
4CDD            4       0%        3.28%               0.60%
8CDD            8       0%        4.35%               2.98% 
 

-> With EEVDF

o tl;dr

- Same as what was observed without EEVDF  but shared_runq shows
  serious regression with multiple more variants of tbench and
  netperf now.

o Kernels

eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
			  (the rd->overload check still remains below the shared_runq access)

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.89)     0.95 [  4.72]( 8.98)         0.99 [  0.83]( 3.77)
 2-groups     1.00 [ -0.00]( 2.04)     0.86 [ 13.87]( 2.59)         0.95 [  4.92]( 1.98)
 4-groups     1.00 [ -0.00]( 2.38)     0.96 [  4.50]( 3.44)         0.98 [  2.45]( 1.93)
 8-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.95]( 1.36)         0.96 [  3.97]( 0.89)
16-groups     1.00 [ -0.00]( 3.44)     1.00 [ -0.00]( 1.59)         0.96 [  3.91]( 3.36)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:  eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
    1     1.00 [  0.00]( 0.18)     1.00 [  0.15]( 0.59)         0.98 [ -1.76]( 0.74)
    2     1.00 [  0.00]( 0.63)     0.97 [ -3.44]( 0.91)         0.98 [ -1.93]( 1.27)
    4     1.00 [  0.00]( 0.86)     0.95 [ -4.86]( 0.85)         0.99 [ -1.15]( 0.77)
    8     1.00 [  0.00]( 0.22)     0.94 [ -6.44]( 1.31)         0.99 [ -1.00]( 0.97)
   16     1.00 [  0.00]( 1.99)     0.86 [-13.68]( 0.38)         1.00 [ -0.47]( 0.99)
   32     1.00 [  0.00]( 4.29)     0.48 [-52.21]( 0.53)         1.01 [  1.24]( 6.66)
   64     1.00 [  0.00]( 1.71)     0.35 [-64.68]( 0.44)         0.99 [ -0.66]( 0.70)
  128     1.00 [  0.00]( 0.65)     0.40 [-60.32]( 0.95)         0.98 [ -2.15]( 1.25)
  256     1.00 [  0.00]( 0.19)     0.72 [-28.28]( 1.88)         0.99 [ -1.39]( 2.50)
  512     1.00 [  0.00]( 0.20)     0.79 [-20.59]( 4.40)         1.00 [ -0.42]( 0.38)
 1024     1.00 [  0.00]( 0.29)     0.80 [-20.24]( 0.64)         0.99 [ -0.51]( 0.20)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)   shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 4.32)     0.94 [ -6.40]( 8.05)          1.01 [  1.39]( 4.58)
Scale     1.00 [  0.00]( 5.21)     0.98 [ -2.15]( 6.79)          0.95 [ -4.54]( 7.35)
  Add     1.00 [  0.00]( 6.25)     0.97 [ -2.64]( 6.47)          0.97 [ -3.08]( 7.49)
Triad     1.00 [  0.00](10.74)     1.01 [  0.92]( 7.40)          1.01 [  1.25]( 8.76)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 Copy     1.00 [  0.00]( 0.70)     1.00 [ -0.07]( 0.70)         1.00 [  0.47]( 0.94)
Scale     1.00 [  0.00]( 6.55)     1.02 [  1.72]( 4.83)         1.03 [  2.96]( 1.00)
  Add     1.00 [  0.00]( 6.53)     1.02 [  1.53]( 4.77)         1.03 [  3.19]( 0.90)
Triad     1.00 [  0.00]( 6.66)     1.00 [  0.06]( 6.29)         0.99 [ -0.70]( 5.79)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:       eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     1.02 [  1.73]( 0.31)           0.99 [ -0.81]( 0.24)
 2-clients     1.00 [  0.00]( 0.38)     0.99 [ -0.68]( 1.17)           0.99 [ -0.87]( 0.45)
 4-clients     1.00 [  0.00]( 0.72)     0.97 [ -3.38]( 1.38)           0.99 [ -1.26]( 0.47)
 8-clients     1.00 [  0.00]( 0.98)     0.94 [ -6.30]( 1.84)           1.00 [ -0.44]( 0.45)
16-clients     1.00 [  0.00]( 0.70)     0.56 [-44.08]( 5.11)           0.99 [ -0.83]( 0.49)
32-clients     1.00 [  0.00]( 0.74)     0.35 [-64.92]( 1.98)           0.98 [ -2.14]( 2.14)
64-clients     1.00 [  0.00]( 2.24)     0.26 [-73.79]( 5.72)           0.97 [ -2.57]( 2.44)
128-clients    1.00 [  0.00]( 1.72)     0.25 [-74.91]( 6.72)           0.96 [ -3.66]( 1.48)
256-clients    1.00 [  0.00]( 4.44)     0.68 [-31.60]( 5.42)           0.96 [ -3.61]( 3.64)
512-clients    1.00 [  0.00](52.42)     0.67 [-32.81](48.45)           0.96 [ -3.80](55.24)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:   eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
  1         1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 6.19)          0.84 [ 16.00](20.83)
  2         1.00 [ -0.00]( 6.42)     0.89 [ 10.71]( 2.34)          0.96 [  3.57]( 4.17)
  4         1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.35)          1.00 [ -0.00]( 9.12)
  8         1.00 [ -0.00](13.83)     1.03 [ -2.63]( 6.96)          0.95 [  5.26]( 6.93)
 16         1.00 [ -0.00]( 4.37)     1.02 [ -2.13]( 4.17)          1.02 [ -2.13]( 3.53)
 32         1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.23)          0.98 [  2.47]( 4.43)
 64         1.00 [ -0.00]( 2.30)     0.96 [  3.85]( 2.34)          0.92 [  7.69]( 4.14)
128         1.00 [ -0.00](12.12)     0.97 [  3.12]( 3.31)          0.93 [  6.53]( 5.31)
256         1.00 [ -0.00](26.04)     1.87 [-86.57](33.02)          1.63 [-62.73](40.63)
512         1.00 [ -0.00]( 5.62)     1.04 [ -3.80]( 0.35)          1.09 [ -8.78]( 2.56)
 
==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================

                                             eevdf                   shared_runq             shared_runq_idle_check
Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41245183.04 (  -0.01%)    41297801.58 (   0.12%)
Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6236534715.56 (  -0.06%)  6237356670.12 (  -0.04%)
Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2893792.10 *   2.52%*     2799609.00 *   5.69%*
Amean     unixbench-syscall-512            7790656.20 (   0.00%)     8489302.67 *  -8.97%*     7685974.47 *   1.34%*
Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2554662.39 *   0.75%*     2521853.23 *  -0.55%*
Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   365752991.35 *   1.21%*   358310503.28 *  -0.85%*
Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4566.00 (   1.33%)        4242.52 *  -5.85%*
Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69554.52 (   0.25%)       69413.14 (   0.05%)
Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3782.82 *  -1.09%*        3832.10 (   0.20%)
Hmean     unixbench-execl-512                12288.64 (   0.00%)       13248.40 (   7.81%)       12661.78 (   3.04%)
 
==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
eevdf                   : 1.00 (var: 1.41%)
shared_runq             : 0.98 (var: 0.84%)  (diff: -2.40%)
shared_runq_idle_check  : 0.97 (var: 0.79%)  (diff: -3.06%)


==================================================================
Test          : DeathStarBench
Units         : %diff, relative to eevdf
Interpretation: Higher is better
Statistic     : AMean
==================================================================
pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
1CDD            1       0%       -0.85%             -1.56%
2CDD            2       0%       -0.60%             -1.22%
4CDD            4       0%        2.87%              0.02%
8CDD            8       0%        0.36%              1.57%


--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-27  4:23                   ` K Prateek Nayak
@ 2023-09-27  6:59                     ` Chen Yu
  2023-09-27  8:36                       ` K Prateek Nayak
  2023-10-03 21:05                       ` David Vernet
  2023-09-27 13:08                     ` David Vernet
  1 sibling, 2 replies; 52+ messages in thread
From: Chen Yu @ 2023-09-27  6:59 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: David Vernet, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy, aaron.lu,
	wuyun.abel, kernel-team

Hi Prateek,

On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> Hello David,
> 
> Some more test results (although this might be slightly irrelevant with
> next version around the corner)
> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> -> With EEVDF
> 
> o tl;dr
> 
> - Same as what was observed without EEVDF  but shared_runq shows
>   serious regression with multiple more variants of tbench and
>   netperf now.
> 
> o Kernels
> 
> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
>

I did not see any obvious regression on a Sapphire Rapids server and it seems that
the result on your platform suggests that C/S workload could be impacted
by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
(no shared resource between tasks if I understand correctly) could benefit from
shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
The question would be how to define C/S tasks. At first thought:
A only wakes up B, and B only wakes up A, then they could be regarded as a pair
of C/S
 (A->last_wakee == B && B->last_wakee == A &&
  A->wakee_flips <= 1 && B->wakee_flips <= 1)
But for netperf/tbench, this does not apply, because netperf client leverages kernel
thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-27  6:59                     ` Chen Yu
@ 2023-09-27  8:36                       ` K Prateek Nayak
  2023-09-28  8:41                         ` Chen Yu
  2023-10-03 21:05                       ` David Vernet
  1 sibling, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-09-27  8:36 UTC (permalink / raw)
  To: Chen Yu
  Cc: David Vernet, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy, aaron.lu,
	wuyun.abel, kernel-team

Hello Chenyu,

On 9/27/2023 12:29 PM, Chen Yu wrote:
> Hi Prateek,
> 
> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> Some more test results (although this might be slightly irrelevant with
>> next version around the corner)
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>> -> With EEVDF
>>
>> o tl;dr
>>
>> - Same as what was observed without EEVDF  but shared_runq shows
>>   serious regression with multiple more variants of tbench and
>>   netperf now.
>>
>> o Kernels
>>
>> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
>> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
>> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
>> 			  (the rd->overload check still remains below the shared_runq access)
>>
> 
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from
> shared_runq a lot.

Yup that would be my guess too since HHVM seems to benefit purely from
more aggressive work conservation. (unless it leads to some second order
effect)

> This makes me wonder if we can let shared_runq skip the C/S tasks.
> The question would be how to define C/S tasks. At first thought:
> A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> of C/S
>  (A->last_wakee == B && B->last_wakee == A &&
>   A->wakee_flips <= 1 && B->wakee_flips <= 1)
> But for netperf/tbench, this does not apply, because netperf client leverages kernel
> thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> wakes up B. Unless we have a chain, we can not detect this wakeup behavior.

Yup, unless we have a notion of chain/flow, or until we can somehow
account the wakeup of client using the kthread to the server, this will
be hard to detect.

I can give it a try with the SIS_PAIR condition you shared above. Let
me know.

> 
> thanks,
> Chenyu

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-27  4:23                   ` K Prateek Nayak
  2023-09-27  6:59                     ` Chen Yu
@ 2023-09-27 13:08                     ` David Vernet
  1 sibling, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-09-27 13:08 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team, yu.c.chen

On Wed, Sep 27, 2023 at 09:53:13AM +0530, K Prateek Nayak wrote:
> Hello David,

Hi Prateek,

> Some more test results (although this might be slightly irrelevant with
> next version around the corner)

Excellent, thanks for running these tests. The results are certainly
encouraging, and it seems clear that we could really improve the feature
by adding some of the logic you've experimented with to v5 (whether it's
the rq->avg_idle check, the per-shard overload, etc). I know that I owe
at least you and Chenyu more substantive responses on this and a few of
other emails that have been sent over the last week or two. I apologize
for the latency in my responses. I'm still at Kernel Recipes, but plan
to focus on this for the next couple of weeks after I'm back stateside.
I originally intended to revisit this _last_ week after my PTO, but got
caught up in helping with some sched_ext related stuff.

Just wanted to give you and everyone else a heads up that I haven't been
ignoring this intentionally, I've just been preempted a lot with travel
this month.

All of the work you folks are putting in is extremely helpful and
appreciated. I'm excited by the improvements we're making to
SHARED_RUNQ, and think that a lot of this can and should be folded into
v5.

So with that said, please feel free to keep experimenting and
discussing, and I'll rejoin the convo full time as soon as I can (which
should be either Friday or next week).

- David

> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> > Hi Prateek,
> > 
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>      - 10.94% native_queued_spin_lock_slowpath
> >>         - 10.73% _raw_spin_lock
> >>            - 9.57% __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>            - 0.82% task_rq_lock
> >>                 newidle_balance
> >>                 pick_next_task_fair
> >>                 __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> > 
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
> > 
> I have some more numbers. This time I'm accounting the cost for peeking
> into the shared-runq and have two variants - one that keeps the current
> vanilla flow from your v3 and the other that moves the rq->avg_idle
> check before looking at the shared-runq. Following are the results:
> 
> -> Without EEVDF
> 
> o tl;dr
> 
> - With avg_idle check, the improvements observed with shared-runq
>   aren't as large but they are still noticeable.
> 
> - Most regressions are gone nad the others aren't as bad with the
>   introduction of shared-runq
> 
> o Kernels
> 
> base			: tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use")
> shared_runq		: base + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
> 
> o Benchmarks
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:           base[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 2.64)     0.90 [ 10.20]( 8.79)             0.93 [  7.08]( 3.87)
>  2-groups     1.00 [ -0.00]( 2.97)     0.85 [ 15.06]( 4.86)             0.96 [  4.47]( 2.22)
>  4-groups     1.00 [ -0.00]( 1.84)     0.93 [  7.38]( 2.63)             0.94 [  6.07]( 1.02)
>  8-groups     1.00 [ -0.00]( 1.24)     0.97 [  2.83]( 2.69)             0.98 [  1.82]( 1.01)
> 16-groups     1.00 [ -0.00]( 3.31)     1.03 [ -2.93]( 2.46)             1.02 [ -1.61]( 1.34)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:   base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>     1     1.00 [  0.00]( 1.08)     0.98 [ -1.89]( 0.48)              0.99 [ -0.73]( 0.70)
>     2     1.00 [  0.00]( 0.69)     0.99 [ -1.48]( 0.24)              0.98 [ -1.62]( 0.85)
>     4     1.00 [  0.00]( 0.70)     0.97 [ -2.87]( 1.34)              0.98 [ -2.15]( 0.48)
>     8     1.00 [  0.00]( 0.85)     0.97 [ -3.17]( 1.56)              0.99 [ -1.32]( 1.09)
>    16     1.00 [  0.00]( 2.18)     0.91 [ -8.70]( 0.27)              0.98 [ -2.03]( 1.28)
>    32     1.00 [  0.00]( 3.84)     0.51 [-48.53]( 2.52)              1.01 [  1.41]( 3.83)
>    64     1.00 [  0.00]( 7.06)     0.38 [-62.49]( 1.89)              1.05 [  5.33]( 4.09)
>   128     1.00 [  0.00]( 0.88)     0.41 [-58.92]( 0.28)              1.01 [  0.54]( 1.65)
>   256     1.00 [  0.00]( 0.88)     0.97 [ -2.56]( 1.78)              1.00 [ -0.48]( 0.33)
>   512     1.00 [  0.00]( 0.07)     1.00 [  0.06]( 0.04)              0.98 [ -1.51]( 0.44)
>  1024     1.00 [  0.00]( 0.30)     0.99 [ -1.35]( 0.90)              1.00 [ -0.24]( 0.41)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 8.87)     1.00 [  0.31]( 5.27)                1.09 [  9.11]( 0.58)
> Scale     1.00 [  0.00]( 6.80)     0.99 [ -0.81]( 7.20)                1.00 [  0.49]( 5.67)
>   Add     1.00 [  0.00]( 7.24)     0.99 [ -1.13]( 7.02)                1.02 [  2.06]( 6.36)
> Triad     1.00 [  0.00]( 5.00)     0.96 [ -4.11]( 9.37)                1.03 [  3.46]( 4.41)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:      base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.45)     1.00 [  0.32]( 1.88)              1.04 [  4.02]( 1.45)
> Scale     1.00 [  0.00]( 4.40)     0.98 [ -1.76]( 6.46)              1.01 [  1.28]( 1.00)
>   Add     1.00 [  0.00]( 4.97)     0.98 [ -1.85]( 6.01)              1.03 [  2.75]( 0.24)
> Triad     1.00 [  0.00]( 0.24)     0.96 [ -3.82]( 6.41)              0.99 [ -1.10]( 4.47)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:        base[pct imp](CV)      shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     0.98 [ -2.37]( 0.08)              0.99 [ -1.32]( 0.37)
>  2-clients     1.00 [  0.00]( 0.75)     0.98 [ -2.04]( 0.33)              0.98 [ -1.57]( 0.50)
>  4-clients     1.00 [  0.00]( 0.84)     0.97 [ -3.25]( 1.01)              0.99 [ -0.77]( 0.54)
>  8-clients     1.00 [  0.00]( 0.78)     0.96 [ -4.18]( 0.68)              0.99 [ -0.77]( 0.63)
> 16-clients     1.00 [  0.00]( 2.56)     0.84 [-15.71]( 6.33)              1.00 [ -0.35]( 0.58)
> 32-clients     1.00 [  0.00]( 1.03)     0.35 [-64.92]( 8.90)              0.98 [ -1.92]( 1.67)
> 64-clients     1.00 [  0.00]( 2.69)     0.26 [-74.05]( 6.56)              0.98 [ -2.46]( 2.42)
> 128-clients    1.00 [  0.00]( 1.91)     0.25 [-74.81]( 3.67)              0.99 [ -1.50]( 2.15)
> 256-clients    1.00 [  0.00]( 2.21)     0.92 [ -7.73]( 2.29)              0.98 [ -1.51]( 1.85)
> 512-clients    1.00 [  0.00](45.18)     0.96 [ -4.06](52.89)              0.98 [ -2.49](49.22)
> 
> 
> ==================================================================
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:        base[pct imp](CV)     shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>   1             1.00 [ -0.00](12.03)     1.04 [ -4.35](34.64)              1.13 [-13.04]( 2.25)
>   2             1.00 [ -0.00]( 9.36)     1.00 [ -0.00]( 4.56)              1.12 [-11.54](12.83)
>   4             1.00 [ -0.00]( 1.95)     1.00 [ -0.00](13.36)              0.93 [  6.67]( 9.10)
>   8             1.00 [ -0.00]( 9.01)     0.97 [  2.70]( 4.68)              1.03 [ -2.70](12.11)
>  16             1.00 [ -0.00]( 3.08)     1.02 [ -2.00]( 3.01)              1.00 [ -0.00]( 7.33)
>  32             1.00 [ -0.00]( 0.75)     1.03 [ -2.60]( 8.20)              1.09 [ -9.09]( 4.24)
>  64             1.00 [ -0.00]( 2.15)     0.91 [  9.20]( 1.03)              1.01 [ -0.61]( 7.14)
> 128             1.00 [ -0.00]( 5.18)     1.05 [ -4.57]( 7.74)              1.01 [ -0.57]( 5.62)
> 256             1.00 [ -0.00]( 4.18)     1.06 [ -5.87](51.02)              1.10 [ -9.51](15.82)
> 512             1.00 [ -0.00]( 2.10)     1.03 [ -3.36]( 2.88)              1.06 [ -5.87]( 1.10)
> 
> 
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
>                                                 base                  shared_runq             shared_runq_idle_check
> Hmean     unixbench-dhry2reg-1            41407024.82 (   0.00%)    41211208.57 (  -0.47%)     41354094.87 (  -0.13%)
> Hmean     unixbench-dhry2reg-512        6249629291.88 (   0.00%)  6245782129.00 (  -0.06%)   6236514875.97 (  -0.21%)
> Amean     unixbench-syscall-1              2922580.63 (   0.00%)     2928021.57 *  -0.19%*      2895742.17 *   0.92%*
> Amean     unixbench-syscall-512            7606400.73 (   0.00%)     8390396.33 * -10.31%*      8236409.00 *  -8.28%*
> Hmean     unixbench-pipe-1                 2574942.54 (   0.00%)     2610825.75 *   1.39%*      2531492.38 *  -1.69%*
> Hmean     unixbench-pipe-512             364489246.31 (   0.00%)   366388360.22 *   0.52%*    360160487.66 *  -1.19%*
> Hmean     unixbench-spawn-1                   4428.94 (   0.00%)        4391.20 (  -0.85%)         4541.06 (   2.53%)
> Hmean     unixbench-spawn-512                68883.47 (   0.00%)       69143.38 (   0.38%)        69776.01 *   1.30%*
> Hmean     unixbench-execl-1                   3878.47 (   0.00%)        3861.63 (  -0.43%)         3873.96 (  -0.12%)
> Hmean     unixbench-execl-512                11638.84 (   0.00%)       12758.38 *   9.62%*        14001.23 *  20.30%*
> 
> 
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> tip                     : 1.00 (var: 1.41%)
> shared_runq             : 0.99 (var: 0.84%)  (diff: -1.40%)
> shared_runq_idle_check  : 1.00 (var: 0.79%)  (diff:  0.00%)
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : %diff, relative to base
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
> 1CDD            1       0%       -0.39%              -0.09%
> 2CDD            2       0%       -0.31%              -1.73%
> 4CDD            4       0%        3.28%               0.60%
> 8CDD            8       0%        4.35%               2.98% 
>  
> 
> -> With EEVDF
> 
> o tl;dr
> 
> - Same as what was observed without EEVDF  but shared_runq shows
>   serious regression with multiple more variants of tbench and
>   netperf now.
> 
> o Kernels
> 
> eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> 			  (the rd->overload check still remains below the shared_runq access)
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:          eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 1.89)     0.95 [  4.72]( 8.98)         0.99 [  0.83]( 3.77)
>  2-groups     1.00 [ -0.00]( 2.04)     0.86 [ 13.87]( 2.59)         0.95 [  4.92]( 1.98)
>  4-groups     1.00 [ -0.00]( 2.38)     0.96 [  4.50]( 3.44)         0.98 [  2.45]( 1.93)
>  8-groups     1.00 [ -0.00]( 1.52)     1.01 [ -0.95]( 1.36)         0.96 [  3.97]( 0.89)
> 16-groups     1.00 [ -0.00]( 3.44)     1.00 [ -0.00]( 1.59)         0.96 [  3.91]( 3.36)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:  eevdf[pct imp](CV)    shared_runq[pct imp](CV)  shared_runq_idle_check[pct imp](CV)
>     1     1.00 [  0.00]( 0.18)     1.00 [  0.15]( 0.59)         0.98 [ -1.76]( 0.74)
>     2     1.00 [  0.00]( 0.63)     0.97 [ -3.44]( 0.91)         0.98 [ -1.93]( 1.27)
>     4     1.00 [  0.00]( 0.86)     0.95 [ -4.86]( 0.85)         0.99 [ -1.15]( 0.77)
>     8     1.00 [  0.00]( 0.22)     0.94 [ -6.44]( 1.31)         0.99 [ -1.00]( 0.97)
>    16     1.00 [  0.00]( 1.99)     0.86 [-13.68]( 0.38)         1.00 [ -0.47]( 0.99)
>    32     1.00 [  0.00]( 4.29)     0.48 [-52.21]( 0.53)         1.01 [  1.24]( 6.66)
>    64     1.00 [  0.00]( 1.71)     0.35 [-64.68]( 0.44)         0.99 [ -0.66]( 0.70)
>   128     1.00 [  0.00]( 0.65)     0.40 [-60.32]( 0.95)         0.98 [ -2.15]( 1.25)
>   256     1.00 [  0.00]( 0.19)     0.72 [-28.28]( 1.88)         0.99 [ -1.39]( 2.50)
>   512     1.00 [  0.00]( 0.20)     0.79 [-20.59]( 4.40)         1.00 [ -0.42]( 0.38)
>  1024     1.00 [  0.00]( 0.29)     0.80 [-20.24]( 0.64)         0.99 [ -0.51]( 0.20)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)   shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 4.32)     0.94 [ -6.40]( 8.05)          1.01 [  1.39]( 4.58)
> Scale     1.00 [  0.00]( 5.21)     0.98 [ -2.15]( 6.79)          0.95 [ -4.54]( 7.35)
>   Add     1.00 [  0.00]( 6.25)     0.97 [ -2.64]( 6.47)          0.97 [ -3.08]( 7.49)
> Triad     1.00 [  0.00](10.74)     1.01 [  0.92]( 7.40)          1.01 [  1.25]( 8.76)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:     eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.70)     1.00 [ -0.07]( 0.70)         1.00 [  0.47]( 0.94)
> Scale     1.00 [  0.00]( 6.55)     1.02 [  1.72]( 4.83)         1.03 [  2.96]( 1.00)
>   Add     1.00 [  0.00]( 6.53)     1.02 [  1.53]( 4.77)         1.03 [  3.19]( 0.90)
> Triad     1.00 [  0.00]( 6.66)     1.00 [  0.06]( 6.29)         0.99 [ -0.70]( 5.79)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:       eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     1.02 [  1.73]( 0.31)           0.99 [ -0.81]( 0.24)
>  2-clients     1.00 [  0.00]( 0.38)     0.99 [ -0.68]( 1.17)           0.99 [ -0.87]( 0.45)
>  4-clients     1.00 [  0.00]( 0.72)     0.97 [ -3.38]( 1.38)           0.99 [ -1.26]( 0.47)
>  8-clients     1.00 [  0.00]( 0.98)     0.94 [ -6.30]( 1.84)           1.00 [ -0.44]( 0.45)
> 16-clients     1.00 [  0.00]( 0.70)     0.56 [-44.08]( 5.11)           0.99 [ -0.83]( 0.49)
> 32-clients     1.00 [  0.00]( 0.74)     0.35 [-64.92]( 1.98)           0.98 [ -2.14]( 2.14)
> 64-clients     1.00 [  0.00]( 2.24)     0.26 [-73.79]( 5.72)           0.97 [ -2.57]( 2.44)
> 128-clients    1.00 [  0.00]( 1.72)     0.25 [-74.91]( 6.72)           0.96 [ -3.66]( 1.48)
> 256-clients    1.00 [  0.00]( 4.44)     0.68 [-31.60]( 5.42)           0.96 [ -3.61]( 3.64)
> 512-clients    1.00 [  0.00](52.42)     0.67 [-32.81](48.45)           0.96 [ -3.80](55.24)
> 
> 
> ==================================================================
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:   eevdf[pct imp](CV)    shared_runq[pct imp](CV)    shared_runq_idle_check[pct imp](CV)
>   1         1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 6.19)          0.84 [ 16.00](20.83)
>   2         1.00 [ -0.00]( 6.42)     0.89 [ 10.71]( 2.34)          0.96 [  3.57]( 4.17)
>   4         1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.35)          1.00 [ -0.00]( 9.12)
>   8         1.00 [ -0.00](13.83)     1.03 [ -2.63]( 6.96)          0.95 [  5.26]( 6.93)
>  16         1.00 [ -0.00]( 4.37)     1.02 [ -2.13]( 4.17)          1.02 [ -2.13]( 3.53)
>  32         1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.23)          0.98 [  2.47]( 4.43)
>  64         1.00 [ -0.00]( 2.30)     0.96 [  3.85]( 2.34)          0.92 [  7.69]( 4.14)
> 128         1.00 [ -0.00](12.12)     0.97 [  3.12]( 3.31)          0.93 [  6.53]( 5.31)
> 256         1.00 [ -0.00](26.04)     1.87 [-86.57](33.02)          1.63 [-62.73](40.63)
> 512         1.00 [ -0.00]( 5.62)     1.04 [ -3.80]( 0.35)          1.09 [ -8.78]( 2.56)
>  
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
> 
>                                              eevdf                   shared_runq             shared_runq_idle_check
> Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41245183.04 (  -0.01%)    41297801.58 (   0.12%)
> Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6236534715.56 (  -0.06%)  6237356670.12 (  -0.04%)
> Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2893792.10 *   2.52%*     2799609.00 *   5.69%*
> Amean     unixbench-syscall-512            7790656.20 (   0.00%)     8489302.67 *  -8.97%*     7685974.47 *   1.34%*
> Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2554662.39 *   0.75%*     2521853.23 *  -0.55%*
> Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   365752991.35 *   1.21%*   358310503.28 *  -0.85%*
> Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4566.00 (   1.33%)        4242.52 *  -5.85%*
> Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69554.52 (   0.25%)       69413.14 (   0.05%)
> Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3782.82 *  -1.09%*        3832.10 (   0.20%)
> Hmean     unixbench-execl-512                12288.64 (   0.00%)       13248.40 (   7.81%)       12661.78 (   3.04%)
>  
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> eevdf                   : 1.00 (var: 1.41%)
> shared_runq             : 0.98 (var: 0.84%)  (diff: -2.40%)
> shared_runq_idle_check  : 0.97 (var: 0.79%)  (diff: -3.06%)
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : %diff, relative to eevdf
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> pinning      scaling   eevdf   shared_runq    shared_runq_idle_check
> 1CDD            1       0%       -0.85%             -1.56%
> 2CDD            2       0%       -0.60%             -1.22%
> 4CDD            4       0%        2.87%              0.02%
> 8CDD            8       0%        0.36%              1.57%
> 
> 
> --
> Thanks and Regards,
> Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-27  8:36                       ` K Prateek Nayak
@ 2023-09-28  8:41                         ` Chen Yu
  0 siblings, 0 replies; 52+ messages in thread
From: Chen Yu @ 2023-09-28  8:41 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: David Vernet, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy, aaron.lu,
	wuyun.abel, kernel-team

On 2023-09-27 at 14:06:41 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/27/2023 12:29 PM, Chen Yu wrote:
> > Hi Prateek,
> > 
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> Some more test results (although this might be slightly irrelevant with
> >> next version around the corner)
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>

[snip]

> > This makes me wonder if we can let shared_runq skip the C/S tasks.
> > The question would be how to define C/S tasks. At first thought:
> > A only wakes up B, and B only wakes up A, then they could be regarded as a pair
> > of C/S
> >  (A->last_wakee == B && B->last_wakee == A &&
> >   A->wakee_flips <= 1 && B->wakee_flips <= 1)
> > But for netperf/tbench, this does not apply, because netperf client leverages kernel
> > thread(workqueue) to wake up the netserver, that is A wakes up kthread T, then T
> > wakes up B. Unless we have a chain, we can not detect this wakeup behavior.
> 
> Yup, unless we have a notion of chain/flow, or until we can somehow
> account the wakeup of client using the kthread to the server, this will
> be hard to detect.
> 
> I can give it a try with the SIS_PAIR condition you shared above. Let
> me know.

Thanks Krateek, but I don't think SIS_PAIR could bring benefit to the netperf/tbench
since SIS_PAIR can not detect the chain wakeup.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-08-31 20:23                   ` K Prateek Nayak
@ 2023-09-29 17:01                     ` David Vernet
  2023-10-04  4:21                       ` K Prateek Nayak
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-09-29 17:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> Hello David,
> 
> On 9/1/2023 12:41 AM, David Vernet wrote:
> > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > 
> > Hi Prateek,
> > 
> >> Even with the two patches, I still observe the following lock
> >> contention when profiling the tbench 128-clients run with IBS:
> >>
> >>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>      - 10.94% native_queued_spin_lock_slowpath
> >>         - 10.73% _raw_spin_lock
> >>            - 9.57% __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>            - 0.82% task_rq_lock
> >>                 newidle_balance
> >>                 pick_next_task_fair
> >>                 __schedule
> >>                 schedule_idle
> >>                 do_idle
> >>               + cpu_startup_entry
> >>
> >> Since David mentioned rq->avg_idle check is probably not the right step
> >> towards the solution, this experiment introduces a per-shard
> >> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >> notifies of the possibility of one or more rq covered in the shard's
> >> domain having a queued task. shard's overload flag is set at the same
> >> time as "rq->rd->overload", and is cleared when shard's list is found
> >> to be empty.
> > 
> > I think this is an interesting idea, but I feel that it's still working
> > against the core proposition of SHARED_RUNQ, which is to enable work
> > conservation.
> 
> I don't think so! Work conservation is possible if there is an
> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> have 16 CPUs, 15 of which are running these 15 tasks, and one going

I'm not sure I'm fully following. Those 15 tasks would not be enqueued
in the shared runq if they were being run. They would be dequeued from
the shared_runq in __dequeue_entity(), which would be called from
set_next_entity() before they were run. In this case, the
shard->overload check should be equivalent to the
!list_empty(&shard->list) check.

Oh, or is the idea that we're not bothering to pull them from the
shared_runq if they're being woken up and enqueued on an idle core that
will immediately run them on the next resched path? If so, I wonder if
we would instead just want to not enqueue the task in the shared_runq at
all? Consider that if another task comes in on an rq with
rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
were being woken up on idle cores (nor take the overhead of inserting
and then immediately removing them from the shared_runq).

> idle. Work is conserved. What we need to worry about is tasks being in
> the shared_runq that are queued on their respective CPU. This can only
> happen if any one of the rq has nr_running >= 2, which is also the point
> we are setting "shard->overload".

Assuming this is about the "wakeup / enqueue to idle core" case, ok,
this makes sense. I still think it probably makes more sense to just not
enqueue in the shared_runq for this case though, which would allow us to
instead just rely on list_empty(&shard->list).

> Now situation can change later and all tasks in the shared_runq might be
> running on respective CPUs but "shard->overload" is only cleared when
> the shared_runq becomes empty. If this is too late, maybe we can clear
> it if periodic load balancing finds no queuing (somewhere around the
> time we update nr_idle_scan).
> 
> So the window where we do not go ahead with popping a task from the
> shared_runq_shard->list is between the list being empty and at least one
> of the CPU associated with the shard reporting nr_running >= 2, which is
> when work conservation is needed.

So, I misread your patch the first time I reviewed it, and for some
reason thought you were only setting shard->overload on the
load_balance(). That's obviously not the case, and I now understand it
better, modulo my points above being clarified.

> > 
> >> With these changes, following are the results for tbench 128-clients:
> > 
> > Just to make sure I understand, this is to address the contention we're
> > observing on tbench with 64 - 256 clients, right?  That's my
> > understanding from Gautham's reply in [0].
> > 
> > [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/
> 
> I'm not sure if Gautham saw a contention with IBS but he did see an
> abnormal blowup in newidle_balance() counts which he suspected were the
> cause for the regression. I noticed the rq lock contention after doing a
> fair bit of surgery. Let me go check if that was the case with vanilla
> v3 too.
> 
> > 
> > If so, are we sure this change won't regress other workloads that would
> > have benefited from the work conservation?
> 
> I don't think we'll regress any workloads as I explained above because
> the "overload" flag being 0 almost (since update/access is not atomic)
> always indicate a case where the tasks cannot be pulled. However, that
> needs to be tested since there is a small behavior change in
> shared_runq_pick_next_task(). Where previously if the task is running
> on CPU, we would have popped it from shared_runq, did some lock
> fiddling before finding out it is running, some more lock fiddling
> before the function returned "-1", now with the changes here, it'll
> simply return a "0" and although that is correct, we have seen some
> interesting cases in past [1] where a random lock contention actually
> helps certain benchmarks ¯\_(ツ)_/¯

I don't think we need to worry about less lock contention possibly
hurting benchmarks :-)

> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 
> 
> > 
> > Also, I assume that you don't see the improved contention without this,
> > even if you include your fix to the newidle_balance() that has us skip
> > over the <= LLC domain?
> 
> No improvements! The lock is still very much contended for. I wonder if
> it could be because of the unlocking and locking the rq again in
> shared_runq_pick_next_task() even when task is on CPU. Also since it
> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
> which can have further implications.

Yeah, I could see it being an issue if we're essentially thrashing tasks
in the shared_runq that are just temporarily enqueued in the shared_runq
between activate and doing a resched pass on an idle core.

Unfortunately, I don't think we have any choice but to drop and
reacquire the rq lock. It's not safe to look at task_cpu(p) without
pi_lock, and we can't safely acquire that without dropping the rq lock.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-27  6:59                     ` Chen Yu
  2023-09-27  8:36                       ` K Prateek Nayak
@ 2023-10-03 21:05                       ` David Vernet
  2023-10-07  2:10                         ` Chen Yu
  1 sibling, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-10-03 21:05 UTC (permalink / raw)
  To: Chen Yu
  Cc: K Prateek Nayak, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy, aaron.lu,
	wuyun.abel, kernel-team

On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> Hi Prateek,

Hi Chenyu,

> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > Hello David,
> > 
> > Some more test results (although this might be slightly irrelevant with
> > next version around the corner)
> > 
> > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > > 
> > -> With EEVDF
> > 
> > o tl;dr
> > 
> > - Same as what was observed without EEVDF  but shared_runq shows
> >   serious regression with multiple more variants of tbench and
> >   netperf now.
> > 
> > o Kernels
> > 
> > eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> > shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > 			  (the rd->overload check still remains below the shared_runq access)
> >
> 
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from

Correct, hhvmworkers are largely independent, though they do sometimes
synchronize, and they also sometimes rely on I/O happening in other
tasks.

> shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.

I'm also open to this possibility, but I worry that we'd be going down
the same rabbit hole as what fair.c does already, which is use
heuristics to determine when something should or shouldn't be migrated,
etc. I really do feel that there's value in SHARED_RUNQ providing
consistent and predictable work conservation behavior.

On the other hand, it's clear that there are things we can do to improve
performance for some of these client/server workloads that hammer the
runqueue on larger CCXs / sockets. If we can avoid those regressions
while still having reasonably high confidence that work conservation
won't disproportionately suffer, I'm open to us making some tradeoffs
and/or adding a bit of complexity to avoid some of this unnecessary
contention.

I think it's probably about time for v4 to be sent out. What do you
folks think about including:

1. A few various fixes / tweaks from v3, e.g. avoiding using the wrong
   shard on the task_dead_fair() path if the feature is disabled before
   a dying task is dequeued from a shard, fixing the build issues
   pointed out by lkp, etc.
2. Fix the issue that Prateek pointed out in [0] where we're not
   properly skipping the LLC domain due to using the for_each_domain()
   macro (this is also addressed by (3)).
3. Apply Prateek's suggestions (in some form) in [1] and [2]. For [2],
   I'm inclined to just avoid enqueuing a task on a shard if the rq it's on
   has nr_running == 0. Or, we can just add his patch to the series
   directly if it turns out that just looking at rq->nr_running is
   insufficient.

[0]: https://lore.kernel.org/all/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
[1]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@amd.com/
[2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@amd.com/

Prateek -- what do you think about this? I want to make sure you get
credit for your contributions to this series, so let me know how you'd
like to apply these changes. [1] essentially just improves much of the
logic from [3], so I'm not sure it would make sense to include it as a
separate patch. I'm happy to include a Co-authored-by tag, or to just
explicitly credit your contributions in the commit summary if you'd
prefer that.

[3]: https://lore.kernel.org/all/20230809221218.163894-7-void@manifault.com/

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-09-29 17:01                     ` David Vernet
@ 2023-10-04  4:21                       ` K Prateek Nayak
  2023-10-04 17:20                         ` David Vernet
  0 siblings, 1 reply; 52+ messages in thread
From: K Prateek Nayak @ 2023-10-04  4:21 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

Thank you for answering my queries, I'll leave some data below to
answer yours.

On 9/29/2023 10:31 PM, David Vernet wrote:
> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>> Hello David,
>>
>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>
>>> Hi Prateek,
>>>
>>>> Even with the two patches, I still observe the following lock
>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>
>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>         - 10.73% _raw_spin_lock
>>>>            - 9.57% __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>            - 0.82% task_rq_lock
>>>>                 newidle_balance
>>>>                 pick_next_task_fair
>>>>                 __schedule
>>>>                 schedule_idle
>>>>                 do_idle
>>>>               + cpu_startup_entry
>>>>
>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>> towards the solution, this experiment introduces a per-shard
>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>> notifies of the possibility of one or more rq covered in the shard's
>>>> domain having a queued task. shard's overload flag is set at the same
>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>> to be empty.
>>>
>>> I think this is an interesting idea, but I feel that it's still working
>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>> conservation.
>>
>> I don't think so! Work conservation is possible if there is an
>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> 
> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> in the shared runq if they were being run. They would be dequeued from
> the shared_runq in __dequeue_entity(), which would be called from
> set_next_entity() before they were run. In this case, the
> shard->overload check should be equivalent to the
> !list_empty(&shard->list) check.
> 
> Oh, or is the idea that we're not bothering to pull them from the
> shared_runq if they're being woken up and enqueued on an idle core that
> will immediately run them on the next resched path? If so, I wonder if
> we would instead just want to not enqueue the task in the shared_runq at
> all? Consider that if another task comes in on an rq with
> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> were being woken up on idle cores (nor take the overhead of inserting
> and then immediately removing them from the shared_runq).

So this is the breakdown of outcomes after peeking into the shared_runq
during newidle_balance:

                                                SHARED_RUNQ                     SHARED_RUNQ
                                        + correct cost accounting       + correct cost accounting
                                                                        + rq->avg_idle early bail

tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)

attempts                                :       6,560,413                  2,273,334           (-65.35%)
shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)

As you can see, there are more attempts and a greater chance of success
in the case without the rq->avg_idle check upfront. Where the problem
lies (at least what I believe is) a task is waiting to be enqueued / has
been enqueued while we are trying to migrate a task fetched from the
shared_runq. Thus, instead of just being idle for a short duration and
running the task, we are now making it wait till we fetch another task
onto the CPU.

I think the scenario changes as follows with shared_runq:

- Current


      [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

- With shared_runq

      [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
	+-------+	+-------+                       +-------+	+-------+
	|	|	|	|        wakeup         |	|	|	|
	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
	|	|	|	|       -------->       |	|	|	|
	+-------+	+-------+                       +-------+	+-------+

We reach a similar final state but with shared_runq we've paid a price
for task migration. Worst case, the following timeline can happen:

        |
  CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
        |
        |                  pull T1 \             pull T4 /
        |
  CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
        |            [T4 TTWU]
        |

With the rq->avg_idle bailout, it might end up looking like:

        |
  CPU0  | [          T0 R, T1 Q          ] [T1 R ...
        |
        |
  CPU1  | [T3 R] [ I ] [T4 R ...
        |            
        |

If possible, can you check how long is the avg_idle running your
workload? Meanwhile, I believe there are a few workloads that
exhibit same behavior as tbench (large scale idling for short
duration) Let me go check if I can see tbench like issue there.

> 
>> idle. Work is conserved. What we need to worry about is tasks being in
>> the shared_runq that are queued on their respective CPU. This can only
>> happen if any one of the rq has nr_running >= 2, which is also the point
>> we are setting "shard->overload".
> 
> Assuming this is about the "wakeup / enqueue to idle core" case, ok,
> this makes sense. I still think it probably makes more sense to just not
> enqueue in the shared_runq for this case though, which would allow us to
> instead just rely on list_empty(&shard->list).
> 
>> Now situation can change later and all tasks in the shared_runq might be
>> running on respective CPUs but "shard->overload" is only cleared when
>> the shared_runq becomes empty. If this is too late, maybe we can clear
>> it if periodic load balancing finds no queuing (somewhere around the
>> time we update nr_idle_scan).
>>
>> So the window where we do not go ahead with popping a task from the
>> shared_runq_shard->list is between the list being empty and at least one
>> of the CPU associated with the shard reporting nr_running >= 2, which is
>> when work conservation is needed.
> 
> So, I misread your patch the first time I reviewed it, and for some
> reason thought you were only setting shard->overload on the
> load_balance(). That's obviously not the case, and I now understand it
> better, modulo my points above being clarified.
> 
>>>
>>>> With these changes, following are the results for tbench 128-clients:
>>>
>>> Just to make sure I understand, this is to address the contention we're
>>> observing on tbench with 64 - 256 clients, right?  That's my
>>> understanding from Gautham's reply in [0].
>>>
>>> [0]: https://lore.kernel.org/all/ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com/
>>
>> I'm not sure if Gautham saw a contention with IBS but he did see an
>> abnormal blowup in newidle_balance() counts which he suspected were the
>> cause for the regression. I noticed the rq lock contention after doing a
>> fair bit of surgery. Let me go check if that was the case with vanilla
>> v3 too.
>>
>>>
>>> If so, are we sure this change won't regress other workloads that would
>>> have benefited from the work conservation?
>>
>> I don't think we'll regress any workloads as I explained above because
>> the "overload" flag being 0 almost (since update/access is not atomic)
>> always indicate a case where the tasks cannot be pulled. However, that
>> needs to be tested since there is a small behavior change in
>> shared_runq_pick_next_task(). Where previously if the task is running
>> on CPU, we would have popped it from shared_runq, did some lock
>> fiddling before finding out it is running, some more lock fiddling
>> before the function returned "-1", now with the changes here, it'll
>> simply return a "0" and although that is correct, we have seen some
>> interesting cases in past [1] where a random lock contention actually
>> helps certain benchmarks ¯\_(ツ)_/¯
> 
> I don't think we need to worry about less lock contention possibly
> hurting benchmarks :-)

Yup :)

> 
>> [1] https://lore.kernel.org/all/44428f1e-ca2c-466f-952f-d5ad33f12073@amd.com/ 
>>
>>>
>>> Also, I assume that you don't see the improved contention without this,
>>> even if you include your fix to the newidle_balance() that has us skip
>>> over the <= LLC domain?
>>
>> No improvements! The lock is still very much contended for. I wonder if
>> it could be because of the unlocking and locking the rq again in
>> shared_runq_pick_next_task() even when task is on CPU. Also since it
>> return -1 for this case, pick_next_task_fair() will return RETRY_TASK
>> which can have further implications.
> 
> Yeah, I could see it being an issue if we're essentially thrashing tasks
> in the shared_runq that are just temporarily enqueued in the shared_runq
> between activate and doing a resched pass on an idle core.
> 
> Unfortunately, I don't think we have any choice but to drop and
> reacquire the rq lock. It's not safe to look at task_cpu(p) without
> pi_lock, and we can't safely acquire that without dropping the rq lock.

True that. We wouldn't want to run into a deadlock scenario or cause
more lock contention with double locking :(

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-10-04  4:21                       ` K Prateek Nayak
@ 2023-10-04 17:20                         ` David Vernet
  2023-10-05  3:50                           ` K Prateek Nayak
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-10-04 17:20 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
> Hello David,

Hello Prateek,

> 
> Thank you for answering my queries, I'll leave some data below to
> answer yours.
> 
> On 9/29/2023 10:31 PM, David Vernet wrote:
> > On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
> >> Hello David,
> >>
> >> On 9/1/2023 12:41 AM, David Vernet wrote:
> >>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> >>>
> >>> Hi Prateek,
> >>>
> >>>> Even with the two patches, I still observe the following lock
> >>>> contention when profiling the tbench 128-clients run with IBS:
> >>>>
> >>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >>>>      - 10.94% native_queued_spin_lock_slowpath
> >>>>         - 10.73% _raw_spin_lock
> >>>>            - 9.57% __schedule
> >>>>                 schedule_idle
> >>>>                 do_idle
> >>>>               + cpu_startup_entry
> >>>>            - 0.82% task_rq_lock
> >>>>                 newidle_balance
> >>>>                 pick_next_task_fair
> >>>>                 __schedule
> >>>>                 schedule_idle
> >>>>                 do_idle
> >>>>               + cpu_startup_entry
> >>>>
> >>>> Since David mentioned rq->avg_idle check is probably not the right step
> >>>> towards the solution, this experiment introduces a per-shard
> >>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
> >>>> notifies of the possibility of one or more rq covered in the shard's
> >>>> domain having a queued task. shard's overload flag is set at the same
> >>>> time as "rq->rd->overload", and is cleared when shard's list is found
> >>>> to be empty.
> >>>
> >>> I think this is an interesting idea, but I feel that it's still working
> >>> against the core proposition of SHARED_RUNQ, which is to enable work
> >>> conservation.
> >>
> >> I don't think so! Work conservation is possible if there is an
> >> imbalance. Consider the case where we 15 tasks in the shared_runq but we
> >> have 16 CPUs, 15 of which are running these 15 tasks, and one going
> > 
> > I'm not sure I'm fully following. Those 15 tasks would not be enqueued
> > in the shared runq if they were being run. They would be dequeued from
> > the shared_runq in __dequeue_entity(), which would be called from
> > set_next_entity() before they were run. In this case, the
> > shard->overload check should be equivalent to the
> > !list_empty(&shard->list) check.
> > 
> > Oh, or is the idea that we're not bothering to pull them from the
> > shared_runq if they're being woken up and enqueued on an idle core that
> > will immediately run them on the next resched path? If so, I wonder if
> > we would instead just want to not enqueue the task in the shared_runq at
> > all? Consider that if another task comes in on an rq with
> > rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
> > were being woken up on idle cores (nor take the overhead of inserting
> > and then immediately removing them from the shared_runq).

Friendly ping on this point. This is the only scenario where I could see
the overload check helping, so I want to make sure I'm understanding it
and am correct in that just avoiding enqueueing the task in the shard in
this scenario would give us the same benefit.

> So this is the breakdown of outcomes after peeking into the shared_runq
> during newidle_balance:
> 
>                                                 SHARED_RUNQ                     SHARED_RUNQ
>                                         + correct cost accounting       + correct cost accounting
>                                                                         + rq->avg_idle early bail
> 
> tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)
> 
> attempts                                :       6,560,413                  2,273,334           (-65.35%)
> shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
> successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
> unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)
> 
> As you can see, there are more attempts and a greater chance of success
> in the case without the rq->avg_idle check upfront. Where the problem
> lies (at least what I believe is) a task is waiting to be enqueued / has
> been enqueued while we are trying to migrate a task fetched from the
> shared_runq. Thus, instead of just being idle for a short duration and
> running the task, we are now making it wait till we fetch another task
> onto the CPU.
>
> I think the scenario changes as follows with shared_runq:
> 
> - Current
> 
> 
>       [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
> 	+-------+	+-------+                       +-------+	+-------+
> 	|	|	|	|        wakeup         |	|	|	|
> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
> 	|	|	|	|       -------->       |	|	|	|
> 	+-------+	+-------+                       +-------+	+-------+
> 
> - With shared_runq
> 
>       [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
> 	+-------+	+-------+                       +-------+	+-------+
> 	|	|	|	|        wakeup         |	|	|	|
> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
> 	|	|	|	|       -------->       |	|	|	|
> 	+-------+	+-------+                       +-------+	+-------+
> 
> We reach a similar final state but with shared_runq we've paid a price
> for task migration. Worst case, the following timeline can happen:
> 
>         |
>   CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
>         |
>         |                  pull T1 \             pull T4 /
>         |
>   CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
>         |            [T4 TTWU]
>         |
> 
> With the rq->avg_idle bailout, it might end up looking like:
> 
>         |
>   CPU0  | [          T0 R, T1 Q          ] [T1 R ...
>         |
>         |
>   CPU1  | [T3 R] [ I ] [T4 R ...
>         |            
>         |

This certainly seems possible, and wouldn't be terribly surprising or
unexpected. Taking a step back here, I want to be clear that I do
understand the motivation for including the rq->avg_idle check for
SHARED_RUNQ; even just conceptually, and regardless of the numbers you
and others have observed for workloads that do these short sleeps. The
whole idea behind that check is that we want to avoid doing
newidle_balance() if the overhead of doing newidle_balance() would
exceed the amount of time that a task was blocked. Makes sense. Why
would you take the overhead of balancing if you have reason to believe
that a task is likely to be idle for less time than it takes to do a
migration?

There's certainly a reasonable argument for why that should also apply
to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
greater than the amount of time that an sd is expected to be idle, then
it's not worth bothering with SHARED_RUNQ either. On the other hand, the
claim of SHARED_RUNQ is that it's faster than doing a regular balance
pass, because we're doing an O(# shards) iteration to find tasks (before
sharding it was O(1)), rather than O(# CPUs). So if we also do the
rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
cache for a full load_balance() call.

Maybe that makes sense and is ultimately the correct design /
implementation for the feature. I'm not fundamentally opposed to that,
but I think we should be cognizant of the tradeoff we're making. If we
don't include this rq->avg_idle check, then some workloads will regress
because we're doing excessive migrations, but if we do check it, then
others will also regress because we're doing insufficient migrations due
to incorrectly assuming that an rq won't be idle for long. On yet
another hand, maybe it's fine to allow users to work around that by
setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
because we ignore that and set rq->max_idle_balance_cost = curr_cost in
newidle_balance() if we end up doing a balance pass. I also know that
Peter and others discourage the use of these debugfs knobs, so I'm not
sure it's even applicable to point that out as a workaround.

And so hopefully the problem starts to become clear. It doesn't take
long for for us to get mired in heuristics that make it difficult to
reason about the expected behavior of the feature, and also difficult to
reason about future changes as these heuristics have now all crossed
streams. Maybe that's OK, and is preferable to the alternative. My
personal opinion, however, is that it's preferable to provide users with
knobs that do straightforward things that are independent from existing
heuristics and knobs which were added for other circumstances. I'd
rather have confidence that I understand how a feature is supposed to
work, and can easily reason about when it's stupid (or not) to use it,
vs. have an expectation for it to not regress workloads in any scenario.

Note that this doesn't mean we can't make my patches less dumb. I think
your suggestions to e.g. check the overload flag (or possibly even
better to just not enqueue in a shard if the rq isn't overloaded),
re-check ttwu->pending after failing to find a task in the shard, etc
make complete sense. There's no downside -- we're just avoiding
pointless work. It's the heuristics like checking rq->avg_idle that
really worry me.

Peter -- I think it would be helpful if you could weigh in here just to
provide your thoughts on this more "philosophical" question.

> If possible, can you check how long is the avg_idle running your
> workload? Meanwhile, I believe there are a few workloads that
> exhibit same behavior as tbench (large scale idling for short
> duration) Let me go check if I can see tbench like issue there.

Sure thing, in the meantime I'll test this out on HHVM. I've actually
been working on getting a build + testbed ready for a few days, so
hopefully it won't take much longer to get some results. Even if it
turns out that this works great for HHVM, I'd ideally like to get
Peter's and others' thoughts on the above.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-10-04 17:20                         ` David Vernet
@ 2023-10-05  3:50                           ` K Prateek Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: K Prateek Nayak @ 2023-10-05  3:50 UTC (permalink / raw)
  To: David Vernet
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	tj, roman.gushchin, gautham.shenoy, aaron.lu, wuyun.abel,
	kernel-team

Hello David,

On 10/4/2023 10:50 PM, David Vernet wrote:
> On Wed, Oct 04, 2023 at 09:51:18AM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> Thank you for answering my queries, I'll leave some data below to
>> answer yours.
>>
>> On 9/29/2023 10:31 PM, David Vernet wrote:
>>> On Fri, Sep 01, 2023 at 01:53:12AM +0530, K Prateek Nayak wrote:
>>>> Hello David,
>>>>
>>>> On 9/1/2023 12:41 AM, David Vernet wrote:
>>>>> On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
>>>>>
>>>>> Hi Prateek,
>>>>>
>>>>>> Even with the two patches, I still observe the following lock
>>>>>> contention when profiling the tbench 128-clients run with IBS:
>>>>>>
>>>>>>   -   12.61%  swapper          [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>>>      - 10.94% native_queued_spin_lock_slowpath
>>>>>>         - 10.73% _raw_spin_lock
>>>>>>            - 9.57% __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>            - 0.82% task_rq_lock
>>>>>>                 newidle_balance
>>>>>>                 pick_next_task_fair
>>>>>>                 __schedule
>>>>>>                 schedule_idle
>>>>>>                 do_idle
>>>>>>               + cpu_startup_entry
>>>>>>
>>>>>> Since David mentioned rq->avg_idle check is probably not the right step
>>>>>> towards the solution, this experiment introduces a per-shard
>>>>>> "overload" flag. Similar to "rq->rd->overload", per-shard overload flag
>>>>>> notifies of the possibility of one or more rq covered in the shard's
>>>>>> domain having a queued task. shard's overload flag is set at the same
>>>>>> time as "rq->rd->overload", and is cleared when shard's list is found
>>>>>> to be empty.
>>>>>
>>>>> I think this is an interesting idea, but I feel that it's still working
>>>>> against the core proposition of SHARED_RUNQ, which is to enable work
>>>>> conservation.
>>>>
>>>> I don't think so! Work conservation is possible if there is an
>>>> imbalance. Consider the case where we 15 tasks in the shared_runq but we
>>>> have 16 CPUs, 15 of which are running these 15 tasks, and one going
>>>
>>> I'm not sure I'm fully following. Those 15 tasks would not be enqueued
>>> in the shared runq if they were being run. They would be dequeued from
>>> the shared_runq in __dequeue_entity(), which would be called from
>>> set_next_entity() before they were run. In this case, the
>>> shard->overload check should be equivalent to the
>>> !list_empty(&shard->list) check.
>>>
>>> Oh, or is the idea that we're not bothering to pull them from the
>>> shared_runq if they're being woken up and enqueued on an idle core that
>>> will immediately run them on the next resched path? If so, I wonder if
>>> we would instead just want to not enqueue the task in the shared_runq at
>>> all? Consider that if another task comes in on an rq with
>>> rq->nr_running >= 2, that we still wouldn't want to pull the tasks that
>>> were being woken up on idle cores (nor take the overhead of inserting
>>> and then immediately removing them from the shared_runq).
> 
> Friendly ping on this point. This is the only scenario where I could see
> the overload check helping, so I want to make sure I'm understanding it
> and am correct in that just avoiding enqueueing the task in the shard in
> this scenario would give us the same benefit.

Woops! Missed answering this. So the original motivation for
'shard->overload' was that there is a rq lock contention, very likely as
a result of shared_runq_pick_next_task() trying to grab a remote rq's
lock. Looking at shared_runq_pick_next_task(), the criteria
"!task_on_cpu(src_rq, p)" led me to believe we might end up enqueuing a
task that is running on the CPU but now that I take a look at
shared_runq_enqueue_task() being called from __enqueue_entity(), this
should be a very rare scenario. 

However, if a running task was enqueued often into a shared_runq, the
'shard->overload' is an indication that all the runqueues covered by
the shard are not overloaded and hence, peeking into the shard can be
skipped. Let me see if I can grab some more stats to verify what
exactly is happening.

> 
>> So this is the breakdown of outcomes after peeking into the shared_runq
>> during newidle_balance:
>>
>>                                                 SHARED_RUNQ                     SHARED_RUNQ
>>                                         + correct cost accounting       + correct cost accounting
>>                                                                         + rq->avg_idle early bail
>>
>> tbench throughput (normalized)		:	     1.00			2.47	       (146.84%)
>>
>> attempts                                :       6,560,413                  2,273,334           (-65.35%)
>> shared_runq was empty                   :       2,276,307 [34.70%]         1,379,071 [60.66%]  (-39.42%)
>> successful at pulling task              :       2,557,158 [38/98%]           342,839 [15.08%]  (-86.59%)
>> unsuccessful despite fetching task      :       1,726,948 [26.32%]           551,424 [24.26%]  (-68.06%)
>>
>> As you can see, there are more attempts and a greater chance of success
>> in the case without the rq->avg_idle check upfront. Where the problem
>> lies (at least what I believe is) a task is waiting to be enqueued / has
>> been enqueued while we are trying to migrate a task fetched from the
>> shared_runq. Thus, instead of just being idle for a short duration and
>> running the task, we are now making it wait till we fetch another task
>> onto the CPU.
>>
>> I think the scenario changes as follows with shared_runq:
>>
>> - Current
>>
>>
>>       [Short Idling]	[2 tasks]                        [1 task]	[2 tasks]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> - With shared_runq
>>
>>       [pull from CPU1]	[2 tasks]                       [2 tasks]	[1 task]
>> 	+-------+	+-------+                       +-------+	+-------+
>> 	|	|	|	|        wakeup         |	|	|	|
>> 	| CPU 0 |	| CPU 1 |	 on CPU0        | CPU 0 |	| CPU 1 |
>> 	|	|	|	|       -------->       |	|	|	|
>> 	+-------+	+-------+                       +-------+	+-------+
>>
>> We reach a similar final state but with shared_runq we've paid a price
>> for task migration. Worst case, the following timeline can happen:
>>
>>         |
>>   CPU0  | [T0 R, T1 Q] [       T0 R      ] [newidle_balance] [T4 R ...
>>         |
>>         |                  pull T1 \             pull T4 /
>>         |
>>   CPU1  | [T3 R] [newidle_balance] [T1 R, T4 Q] [       T1 R      ]
>>         |            [T4 TTWU]
>>         |
>>
>> With the rq->avg_idle bailout, it might end up looking like:
>>
>>         |
>>   CPU0  | [          T0 R, T1 Q          ] [T1 R ...
>>         |
>>         |
>>   CPU1  | [T3 R] [ I ] [T4 R ...
>>         |            
>>         |
> 
> This certainly seems possible, and wouldn't be terribly surprising or
> unexpected. Taking a step back here, I want to be clear that I do
> understand the motivation for including the rq->avg_idle check for
> SHARED_RUNQ; even just conceptually, and regardless of the numbers you
> and others have observed for workloads that do these short sleeps. The
> whole idea behind that check is that we want to avoid doing
> newidle_balance() if the overhead of doing newidle_balance() would
> exceed the amount of time that a task was blocked. Makes sense. Why
> would you take the overhead of balancing if you have reason to believe
> that a task is likely to be idle for less time than it takes to do a
> migration?
> 
> There's certainly a reasonable argument for why that should also apply
> to SHARED_RUNQ. If the overhead of doing a SHARED_RUNQ migration is
> greater than the amount of time that an sd is expected to be idle, then
> it's not worth bothering with SHARED_RUNQ either. On the other hand, the
> claim of SHARED_RUNQ is that it's faster than doing a regular balance
> pass, because we're doing an O(# shards) iteration to find tasks (before
> sharding it was O(1)), rather than O(# CPUs). So if we also do the
> rq->avg_idle check, that basically means that SHARED_RUNQ becomes a
> cache for a full load_balance() call.
> 
> Maybe that makes sense and is ultimately the correct design /
> implementation for the feature. I'm not fundamentally opposed to that,
> but I think we should be cognizant of the tradeoff we're making. If we
> don't include this rq->avg_idle check, then some workloads will regress
> because we're doing excessive migrations, but if we do check it, then
> others will also regress because we're doing insufficient migrations due
> to incorrectly assuming that an rq won't be idle for long. On yet
> another hand, maybe it's fine to allow users to work around that by
> setting sysctl_sched_migration_cost_ns = 0? That only sort of works,
> because we ignore that and set rq->max_idle_balance_cost = curr_cost in
> newidle_balance() if we end up doing a balance pass. I also know that
> Peter and others discourage the use of these debugfs knobs, so I'm not
> sure it's even applicable to point that out as a workaround.
> 
> And so hopefully the problem starts to become clear. It doesn't take
> long for for us to get mired in heuristics that make it difficult to
> reason about the expected behavior of the feature, and also difficult to
> reason about future changes as these heuristics have now all crossed
> streams. Maybe that's OK, and is preferable to the alternative. My
> personal opinion, however, is that it's preferable to provide users with
> knobs that do straightforward things that are independent from existing
> heuristics and knobs which were added for other circumstances. I'd
> rather have confidence that I understand how a feature is supposed to
> work, and can easily reason about when it's stupid (or not) to use it,
> vs. have an expectation for it to not regress workloads in any scenario.
> 
> Note that this doesn't mean we can't make my patches less dumb. I think
> your suggestions to e.g. check the overload flag (or possibly even
> better to just not enqueue in a shard if the rq isn't overloaded),
> re-check ttwu->pending after failing to find a task in the shard, etc
> make complete sense. There's no downside -- we're just avoiding
> pointless work. It's the heuristics like checking rq->avg_idle that
> really worry me.

I agree since avg_idle is merely a prediction that may or may not be
true.

> 
> Peter -- I think it would be helpful if you could weigh in here just to
> provide your thoughts on this more "philosophical" question.
> 
>> If possible, can you check how long is the avg_idle running your
>> workload? Meanwhile, I believe there are a few workloads that
>> exhibit same behavior as tbench (large scale idling for short
>> duration) Let me go check if I can see tbench like issue there.
> 
> Sure thing, in the meantime I'll test this out on HHVM. I've actually
> been working on getting a build + testbed ready for a few days, so
> hopefully it won't take much longer to get some results. Even if it
> turns out that this works great for HHVM, I'd ideally like to get
> Peter's and others' thoughts on the above.

I'll gather some more data too in the meantime :)

> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag
  2023-10-03 21:05                       ` David Vernet
@ 2023-10-07  2:10                         ` Chen Yu
  0 siblings, 0 replies; 52+ messages in thread
From: Chen Yu @ 2023-10-07  2:10 UTC (permalink / raw)
  To: David Vernet
  Cc: K Prateek Nayak, linux-kernel, peterz, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, tj, roman.gushchin, gautham.shenoy, aaron.lu,
	wuyun.abel, kernel-team

Hi David,

On 2023-10-03 at 16:05:11 -0500, David Vernet wrote:
> On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> > Hi Prateek,
> 
> Hi Chenyu,
> 
> > On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > > Hello David,
> > > 
> > > Some more test results (although this might be slightly irrelevant with
> > > next version around the corner)
> > > 
> > > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > > > 
> > > -> With EEVDF
> > > 
> > > o tl;dr
> > > 
> > > - Same as what was observed without EEVDF  but shared_runq shows
> > >   serious regression with multiple more variants of tbench and
> > >   netperf now.
> > > 
> > > o Kernels
> > > 
> > > eevdf			: tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > > shared_runq		: eevdf + correct time accounting with v3 of the series without any other changes
> > > shared_runq_idle_check	: shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > > 			  (the rd->overload check still remains below the shared_runq access)
> > >
> > 
> > I did not see any obvious regression on a Sapphire Rapids server and it seems that
> > the result on your platform suggests that C/S workload could be impacted
> > by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> > (no shared resource between tasks if I understand correctly) could benefit from
> 
> Correct, hhvmworkers are largely independent, though they do sometimes
> synchronize, and they also sometimes rely on I/O happening in other
> tasks.
> 
> > shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.
> 
> I'm also open to this possibility, but I worry that we'd be going down
> the same rabbit hole as what fair.c does already, which is use
> heuristics to determine when something should or shouldn't be migrated,
> etc. I really do feel that there's value in SHARED_RUNQ providing
> consistent and predictable work conservation behavior.
> 
> On the other hand, it's clear that there are things we can do to improve
> performance for some of these client/server workloads that hammer the
> runqueue on larger CCXs / sockets. If we can avoid those regressions
> while still having reasonably high confidence that work conservation
> won't disproportionately suffer, I'm open to us making some tradeoffs
> and/or adding a bit of complexity to avoid some of this unnecessary
> contention.
> 

Since I did not observe any regression(although I did not test hackbench
yet) on the latest version you sent to me, I'm OK with postponing the
client/server optimization to make the patchset simple, and Prateek
has other proposal to deal with the regression.

> I think it's probably about time for v4 to be sent out. What do you
> folks think about including:
>

It's OK for me and I can launch the test once the latest version is released.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (7 preceding siblings ...)
  2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
@ 2023-11-27  8:28 ` Aboorva Devarajan
  2023-11-27 19:49   ` David Vernet
  2023-12-04 19:30 ` David Vernet
  9 siblings, 1 reply; 52+ messages in thread
From: Aboorva Devarajan @ 2023-11-27  8:28 UTC (permalink / raw)
  To: David Vernet
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team, linux-kernel

On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:

Hi David,

I have been benchmarking the patch-set on POWER9 machine to understand
its impact. However, I've run into a recurring hard-lockups in
newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
doesn't happen all the time, but it's something worth noting. I wanted
to inform you about this, and I can provide more details if needed.

-----------------------------------------

Some inital information regarding the hard-lockup:

Base Kernel:
-----------

Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
tick_stop when cfs bandwidth in use").

Patched Kernel:
-------------

Base Kernel + v3 (shared runqueue patch-set)(
https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
)

The hard-lockup moslty occurs when running the Apache2 benchmarks with
ab (Apache HTTP benchmarking tool) on the patched kernel. However, this
problem is not exclusive to the mentioned benchmark and only occurs
while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ feature
prevents the occurrence of the lockup.

ab (Apache HTTP benchmarking tool): 
https://httpd.apache.org/docs/2.4/programs/ab.html

Hardlockup with Patched Kernel:
------------------------------

[ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
[ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
[ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
[ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
[ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
[ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
[ 3295.182437][  C123] Call Trace:
[ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
[ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
[ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
[ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
[ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
[ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
[ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
[ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
[ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
[ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
[ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
[ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:

-----------------------------------------

[ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
[ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
[ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
[ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
[ 3943.438678][  C112] Workqueue:  0x0 (events)
[ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
[ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
[ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
[ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
[ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
[ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
[ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
[ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
[ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
[ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
[ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
[ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
[ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
[ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
[ 3943.438727][  C112] Call Trace:
[ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
[ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
[ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
[ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
[ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
[ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
[ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
[ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
[ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
[ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001

-----------------------------------------

System configuration:
--------------------

# lscpu
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              4
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    8
Model:                           2.3 (pvr 004e 1203)
Model name:                      POWER9 (raw), altivec supported
Frequency boost:                 enabled
CPU max MHz:                     3800.0000
CPU min MHz:                     2300.0000
L1d cache:                       1 MiB
L1i cache:                       1 MiB
NUMA node0 CPU(s):               64-127
NUMA node8 CPU(s):               0-63
NUMA node250 CPU(s):             
NUMA node251 CPU(s):             
NUMA node252 CPU(s):             
NUMA node253 CPU(s):             
NUMA node254 CPU(s):             
NUMA node255 CPU(s):             

# uname -r
6.5.0-rc2-00028-g7475adccd76b

# cat /sys/kernel/debug/sched/features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
BASE_SLICE HZ_BW SHARED_RUNQ

-----------------------------------------

Please let me know if I've missed anything here. I'll continue
investigating and share any additional information I find.

Thanks and Regards,
Aboorva


> Changes
> -------
> 
> This is v3 of the shared runqueue patchset. This patch set is based
> off
> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") on the sched/core branch of tip.git.
> 
> v1 (RFC): 
> https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> v2: 
> https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
> 
> v2 -> v3 changes:
> - Don't leave stale tasks in the lists when the SHARED_RUNQ feature
> is
>   disabled (Abel Wu)
> 
> - Use raw spin lock instead of spinlock_t (Peter)
> 
> - Fix return value from shared_runq_pick_next_task() to match the
>   semantics expected by newidle_balance() (Gautham, Abel)
> 
> - Fold patch __enqueue_entity() / __dequeue_entity() into previous
> patch
>   (Peter)
> 
> - Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
>   (Peter)
> 
> - Properly support hotplug and recreating sched domains (Peter)
> 
> - Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq
> ==
>   target_rq in shared_runq_pick_next_task() (Abel)
> 
> - Only issue list_del_init() in shared_runq_dequeue_task() if the
> task
>   is still in the list after acquiring the lock (Aaron Lu)
> 
> - Slightly change shared_runq_shard_idx() to make it more likely to
> keep
>   SMT siblings on the same bucket (Peter)
> 
> v1 -> v2 changes:
> - Change name from swqueue to shared_runq (Peter)
> 
> - Shard per-LLC shared runqueues to avoid contention on scheduler-
> heavy
>   workloads (Peter)
> 
> - Pull tasks from the shared_runq in newidle_balance() rather than in
>   pick_next_task_fair() (Peter and Vincent)
> 
> - Rename a few functions to reflect their actual purpose. For
> example,
>   shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> 
> - Expose move_queued_task() from core.c rather than migrate_task_to()
>   (Peter)
> 
> - Properly check is_cpu_allowed() when pulling a task from a
> shared_runq
>   to ensure it can actually be migrated (Peter and Gautham)
> 
> - Dropped RFC tag
> 
> Overview
> ========
> 
> The scheduler must constantly strike a balance between work
> conservation, and avoiding costly migrations which harm performance
> due
> to e.g. decreased cache locality. The matter is further complicated
> by
> the topology of the system. Migrating a task between cores on the
> same
> LLC may be more optimal than keeping a task local to the CPU, whereas
> migrating a task between LLCs or NUMA nodes may tip the balance in
> the
> other direction.
> 
> With that in mind, while CFS is by and large mostly a work conserving
> scheduler, there are certain instances where the scheduler will
> choose
> to keep a task local to a CPU, when it would have been more optimal
> to
> migrate it to an idle core.
> 
> An example of such a workload is the HHVM / web workload at Meta.
> HHVM
> is a VM that JITs Hack and PHP code in service of web requests. Like
> other JIT / compilation workloads, it tends to be heavily CPU bound,
> and
> exhibit generally poor cache locality. To try and address this, we
> set
> several debugfs (/sys/kernel/debug/sched) knobs on our HHVM
> workloads:
> 
> - migration_cost_ns -> 0
> - latency_ns -> 20000000
> - min_granularity_ns -> 10000000
> - wakeup_granularity_ns -> 12000000
> 
> These knobs are intended both to encourage the scheduler to be as
> work
> conserving as possible (migration_cost_ns -> 0), and also to keep
> tasks
> running for relatively long time slices so as to avoid the overhead
> of
> context switching (the other knobs). Collectively, these knobs
> provide a
> substantial performance win; resulting in roughly a 20% improvement
> in
> throughput. Worth noting, however, is that this improvement is _not_
> at
> full machine saturation.
> 
> That said, even with these knobs, we noticed that CPUs were still
> going
> idle even when the host was overcommitted. In response, we wrote the
> "shared runqueue" (SHARED_RUNQ) feature proposed in this patch set.
> The
> idea behind SHARED_RUNQ is simple: it enables the scheduler to be
> more
> aggressively work conserving by placing a waking task into a sharded
> per-LLC FIFO queue that can be pulled from by another core in the LLC
> FIFO queue which can then be pulled from before it goes idle.
> 
> With this simple change, we were able to achieve a 1 - 1.6%
> improvement
> in throughput, as well as a small, consistent improvement in p95 and
> p99
> latencies, in HHVM. These performance improvements were in addition
> to
> the wins from the debugfs knobs mentioned above, and to other
> benchmarks
> outlined below in the Results section.
> 
> Design
> ======
> 
> Note that the design described here reflects sharding, which is the
> implementation added in the final patch of the series (following the
> initial unsharded implementation added in patch 6/7). The design is
> described that way in this commit summary as the benchmarks described
> in
> the results section below all reflect a sharded SHARED_RUNQ.
> 
> The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
> list of struct shared_runq_shard objects, which itself is simply a
> struct list_head of tasks, and a spinlock:
> 
> struct shared_runq_shard {
> 	struct list_head list;
> 	raw_spinlock_t lock;
> } ____cacheline_aligned;
> 
> struct shared_runq {
> 	u32 num_shards;
> 	struct shared_runq_shard shards[];
> } ____cacheline_aligned;
> 
> We create a struct shared_runq per LLC, ensuring they're in their own
> cachelines to avoid false sharing between CPUs on different LLCs, and
> we
> create a number of struct shared_runq_shard objects that are housed
> there.
> 
> When a task first wakes up, it enqueues itself in the
> shared_runq_shard
> of its current LLC at the end of enqueue_task_fair(). Enqueues only
> happen if the task was not manually migrated to the current core by
> select_task_rq(), and is not pinned to a specific CPU.
> 
> A core will pull a task from the shards in its LLC's shared_runq at
> the
> beginning of newidle_balance().
> 
> Difference between SHARED_RUNQ and SIS_NODE
> ===========================================
> 
> In [0] Peter proposed a patch that addresses Tejun's observations
> that
> when workqueues are targeted towards a specific LLC on his Zen2
> machine
> with small CCXs, that there would be significant idle time due to
> select_idle_sibling() not considering anything outside of the current
> LLC.
> 
> This patch (SIS_NODE) is essentially the complement to the proposal
> here. SID_NODE causes waking tasks to look for idle cores in
> neighboring
> LLCs on the same die, whereas SHARED_RUNQ causes cores about to go
> idle
> to look for enqueued tasks. That said, in its current form, the two
> features at are a different scope as SIS_NODE searches for idle cores
> between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.
> 
> The patch was since removed in [1], and we compared the results to
> SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
> outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
> compare against it again for this v2 patch set.
> 
> [0]: 
> https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
> [1]: 
> https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
> [2]: 
> https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> 
> Worth noting as well is that pointed out in [3] that the logic behind
> including SIS_NODE in the first place should apply to SHARED_RUNQ
> (meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
> should benefit from having a single shared_runq stretch across
> multiple
> LLCs). I drafted a patch that implements this by having a minimum LLC
> size for creating a shard, and stretches a shared_runq across
> multiple
> LLCs if they're smaller than that size, and sent it to Tejun to test
> on
> his Zen2. Tejun reported back that SIS_NODE did not seem to make a
> difference:
> 
> [3]: 
> https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/
> 
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> Vanilla:		    | 108.84s    | 0.0057   |
> NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
> SHARED_RUNQ:		    | 108.17s    | 0.038s   |
> SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
> 			    o------------o----------o
> 
> I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
> 7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ
> (though
> a gain was observed relative to NO_SHARED_RUNQ, as described below).
> 
> Results
> =======
> 
> Note that the motivation for the shared runqueue feature was
> originally
> arrived at using experiments in the sched_ext framework that's
> currently
> being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> is similarly visible using work-conserving sched_ext schedulers (even
> very simple ones like global FIFO).
> 
> In both single and multi socket / CCX hosts, this can measurably
> improve
> performance. In addition to the performance gains observed on our
> internal web workloads, we also observed an improvement in common
> workloads such as kernel compile and hackbench, when running shared
> runqueue.
> 
> On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
> that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
> -m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding
> the
> shared datastructures within a CCX, but it doesn't seem to eliminate
> all
> contention in every scenario. On the positive side, it seems that
> sharding does not materially harm the benchmarks run for this patch
> series; and in fact seems to improve some workloads such as kernel
> compile.
> 
> Note that for the kernel compile workloads below, the compilation was
> done by running make -j$(nproc) built-in.a on several different types
> of
> hosts configured with make allyesconfig on commit a27648c74210 ("afs:
> Fix setting of mtime when creating a file/dir/symlink") on Linus'
> tree
> (boost and turbo were disabled on all of these hosts when the
> experiments were performed).
> 
> Finally, note that these results were from the patch set built off of
> commit ebb83d84e49b ("sched/core: Avoid multiple calling
> update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core
> branch
> of tip.git for easy comparison with the v2 patch set results. The
> patches in their final form from this set were rebased onto commit
> 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
> use") on the sched/core branch of tip.git.
> 
> === Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===
> 
> CPU max MHz: 5879.8818
> CPU min MHz: 3000.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
> SHARED_RUNQ:		    | 577.02s    | 0.084s   |
> 			    o------------o----------o
> 
> Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
> improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks
> in
> the shared runqueue on every enqueue improves work conservation, and
> thanks to sharding, does not result in contention.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
> SHARED_RUNQ:		    | 2.0217s    | .00065s  |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ in both forms performs exceptionally well
> compared
> to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
> result given that it seems advantageous to err on the side of
> avoiding
> migration in hackbench given that tasks are short lived in sending
> only
> 10k bytes worth of messages, but the results of the benchmark would
> seem
> to suggest that minimizing runqueue delays is preferable.
> 
> Command:
> for i in `seq 128`; do
>     netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
> SHARED_RUNQ:                | 24952.50   | 1268.06  |
>                             o-----------------------o
> 
> Takeaway: No statistical significance, though it is worth noting that
> there is no regression for shared runqueue on the 7950X, while there
> is
> a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
> described below.
> 
> === Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===
> 
> CPU max MHz: 1601.0000
> CPU min MHz: 800.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
> SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
> 			    o------------o----------o
> 
> Takeaway: There's on statistically significant gain here. I observed
> what I claimed was a .23% win in v2, but it appears that this is not
> actually statistically significant.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
> SHARED_RUNQ:		    | 5.2668s    | .0033s   |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
> NO_SHARED_RUNQ. Also statistically significant, but smaller than the
> 10+% improvement observed on the 7950X.
> 
> Command: netperf -n $(nproc) -l 60 -t TCP_RR
> for i in `seq 128`; do
>         netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
> SHARED_RUNQ:                | 14966.42   | 714.13   |
>                             o-----------------------o
> 
> Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result
> makes
> sense -- the workload is very heavy on the runqueue, so enqueuing
> tasks
> in the shared runqueue in __enqueue_entity() would intuitively result
> in
> increased contention on the shard lock.
> 
> === Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===
> 
> CPU max MHz: 700.0000
> CPU min MHz: 700.0000
> 
> Command: make -j$(nproc) built-in.a
> 			    o____________o__________o
> 			    |    mean    | Variance |
> 			    o------------o----------o
> NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
> SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
> 			    o------------o----------o
> 
> Takeaway: No statistically significant difference here. It might be
> worth experimenting with work stealing in a follow-on patch set.
> 
> Command: hackbench --loops 10000
>                             o____________o__________o
>                             |    mean    | Variance |
>                             o------------o----------o
> NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
> SHARED_RUNQ:		    | 5.1716s    | .00289s  |
>                             o------------o----------o
> 
> Takeaway: SHARED_RUNQ again wins, by about 2%.
> 
> Command: netperf -n $(nproc) -l 60 -t TCP_RR
> for i in `seq 128`; do
>         netperf -6 -t UDP_RR -c -C -l $runtime &
> done
>                             o_______________________o
>                             | Throughput | Variance |
>                             o-----------------------o
> NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
> SHARED_RUNQ:                | 16697.25   | 9812.23  |
>                             o-----------------------o
> 
> Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
> SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
> NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here,
> so
> the results of this benchmark should be taken with a large grain of
> salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
> not contending on the shard lock).
> 
> Finally, let's look at how sharding affects the following schbench
> incantation suggested by Chris in [4]:
> 
> schbench -L -m 52 -p 512 -r 10 -t 1
> 
> [4]: 
> https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> 
> The TL;DR is that sharding improves things a lot, but doesn't
> completely
> fix the problem. Here are the results from running the schbench
> command
> on the 18 core / 36 thread single CCX, single-socket Skylake:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------------
> class name         con-bounces    contentions       waittime-
> min   waittime-max waittime-total   waittime-avg    acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------------
> 
> &shard-
> >lock:      31510503       31510711           0.08          19.98    
>     168932319.64     5.36            31700383      31843851       0.0
> 3           17.50        10273968.33      0.32
> ------------
> &shard->lock       15731657          [<0000000068c0fd75>]
> pick_next_task_fair+0x4dd/0x510
> &shard->lock       15756516          [<000000001faf84f9>]
> enqueue_task_fair+0x459/0x530
> &shard->lock          21766          [<00000000126ec6ab>]
> newidle_balance+0x45a/0x650
> &shard->lock            772          [<000000002886c365>]
> dequeue_task_fair+0x4c9/0x540
> ------------
> &shard->lock          23458          [<00000000126ec6ab>]
> newidle_balance+0x45a/0x650
> &shard->lock       16505108          [<000000001faf84f9>]
> enqueue_task_fair+0x459/0x530
> &shard->lock       14981310          [<0000000068c0fd75>]
> pick_next_task_fair+0x4dd/0x510
> &shard->lock            835          [<000000002886c365>]
> dequeue_task_fair+0x4c9/0x540
> 
> These results are when we create only 3 shards (16 logical cores per
> shard), so the contention may be a result of overly-coarse sharding.
> If
> we run the schbench incantation with no sharding whatsoever, we see
> the
> following significantly worse lock stats contention:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----
> class name        con-bounces    contentions         waittime-
> min   waittime-max waittime-total         waittime-avg    acq-
> bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----
> 
> &shard-
> >lock:     117868635      118361486           0.09           393.01  
>      1250954097.25          10.57           119345882     119780601  
>     0.05          343.35       38313419.51      0.32
> ------------
> &shard->lock       59169196          [<0000000060507011>]
> __enqueue_entity+0xdc/0x110
> &shard->lock       59084239          [<00000000f1c67316>]
> __dequeue_entity+0x78/0xa0
> &shard->lock         108051          [<00000000084a6193>]
> newidle_balance+0x45a/0x650
> ------------
> &shard->lock       60028355          [<0000000060507011>]
> __enqueue_entity+0xdc/0x110
> &shard->lock         119882          [<00000000084a6193>]
> newidle_balance+0x45a/0x650
> &shard->lock       58213249          [<00000000f1c67316>]
> __dequeue_entity+0x78/0xa0
> 
> The contention is ~3-4x worse if we don't shard at all. This roughly
> matches the fact that we had 3 shards on the first workload run
> above.
> If we make the shards even smaller, the contention is comparably much
> lower:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----------------------------------------------------------
> class name         con-bounces    contentions   waittime-
> min  waittime-max waittime-total   waittime-avg   acq-
> bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> ----------------------------------------------------------
> 
> &shard-
> >lock:      13839849       13877596      0.08          13.23        5
> 389564.95       0.39           46910241      48069307       0.06     
>      16.40        16534469.35      0.34
> ------------
> &shard->lock           3559          [<00000000ea455dcc>]
> newidle_balance+0x45a/0x650
> &shard->lock        6992418          [<000000002266f400>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        6881619          [<000000002a62f2e0>]
> __enqueue_entity+0xdc/0x110
> ------------
> &shard->lock        6640140          [<000000002266f400>]
> __dequeue_entity+0x78/0xa0
> &shard->lock           3523          [<00000000ea455dcc>]
> newidle_balance+0x45a/0x650
> &shard->lock        7233933          [<000000002a62f2e0>]
> __enqueue_entity+0xdc/0x110
> 
> Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the
> schbench
> benchmark on Milan as well, but we contend more on the rq lock than
> the
> shard lock:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------
> class name         con-bounces    contentions   waittime-
> min  waittime-max waittime-total   waittime-avg   acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> -----------------------------------------------------------
> 
> &rq-
> >__lock:       9617614        9656091       0.10          79.64      
>   69665812.00      7.21           18092700      67652829       0.11  
>          82.38        344524858.87     5.09
> -----------
> &rq->__lock        6301611          [<000000003e63bf26>]
> task_rq_lock+0x43/0xe0
> &rq->__lock        2530807          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock         109360          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock         178218          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> -----------
> &rq->__lock        3245506          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock        1294355          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> &rq->__lock        2837804          [<000000003e63bf26>]
> task_rq_lock+0x43/0xe0
> &rq->__lock        1627866          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> 
> .....................................................................
> .....................................................................
> ........................................................
> 
> &shard-
> >lock:       7338558       7343244       0.10          35.97        7
> 173949.14       0.98           30200858      32679623       0.08     
>       35.59        16270584.52      0.50
> ------------
> &shard->lock        2004142          [<00000000f8aa2c91>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        2611264          [<00000000473978cc>]
> newidle_balance+0x45a/0x650
> &shard->lock        2727838          [<0000000028f55bb5>]
> __enqueue_entity+0xdc/0x110
> ------------
> &shard->lock        2737232          [<00000000473978cc>]
> newidle_balance+0x45a/0x650
> &shard->lock        1693341          [<00000000f8aa2c91>]
> __dequeue_entity+0x78/0xa0
> &shard->lock        2912671          [<0000000028f55bb5>]
> __enqueue_entity+0xdc/0x110
> 
> .....................................................................
> .....................................................................
> .........................................................
> 
> If we look at the lock stats with SHARED_RUNQ disabled, the rq lock
> still
> contends the most, but it's significantly less than with it enabled:
> 
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> --------------------------------------------------------------
> class name          con-bounces    contentions   waittime-
> min   waittime-max waittime-total   waittime-avg    acq-
> bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> total   holdtime-avg
> -------------------------------------------------------------------
> -------------------------------------------------------------------
> --------------------------------------------------------------
> 
> &rq-
> >__lock:        791277         791690        0.12           110.54   
>     4889787.63       6.18            1575996       62390275       0.1
> 3           112.66       316262440.56     5.07
> -----------
> &rq->__lock         263343          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock          19394          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock           4143          [<000000003b542e83>]
> __task_rq_lock+0x51/0xf0
> &rq->__lock          51094          [<00000000c38a30f9>]
> sched_ttwu_pending+0x3d/0x170
> -----------
> &rq->__lock          23756          [<0000000011be1562>]
> raw_spin_rq_lock_nested+0xa/0x10
> &rq->__lock         379048          [<00000000516703f0>]
> __schedule+0x72/0xaa0
> &rq->__lock            677          [<000000003b542e83>]
> __task_rq_lock+0x51/0xf0
> 
> Worth noting is that increasing the granularity of the shards in
> general
> improves very runqueue-heavy workloads such as netperf UDP_RR and
> this
> schbench command, but it doesn't necessarily make a big difference
> for
> every workload, or for sufficiently small CCXs such as the 7950X. It
> may
> make sense to eventually allow users to control this with a debugfs
> knob, but for now we'll elect to choose a default that resulted in
> good
> performance for the benchmarks run for this patch series.
> 
> Conclusion
> ==========
> 
> SHARED_RUNQ in this form provides statistically significant wins for
> several types of workloads, and various CPU topologies. The reason
> for
> this is roughly the same for all workloads: SHARED_RUNQ encourages
> work
> conservation inside of a CCX by having a CPU do an O(# per-LLC
> shards)
> iteration over the shared_runq shards in an LLC. We could similarly
> do
> an O(n) iteration over all of the runqueues in the current LLC when a
> core is going idle, but that's quite costly (especially for larger
> LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
> ground between doing an O(n) walk, and doing an O(1) pull from a
> single
> per-LLC shared runq.
> 
> For the workloads above, kernel compile and hackbench were clear
> winners
> for SHARED_RUNQ (especially in __enqueue_entity()). The reason for
> the
> improvement in kernel compile is of course that we have a heavily
> CPU-bound workload where cache locality doesn't mean much; getting a
> CPU
> is the #1 goal. As mentioned above, while I didn't expect to see an
> improvement in hackbench, the results of the benchmark suggest that
> minimizing runqueue delays is preferable to optimizing for L1/L2
> locality.
> 
> Not all workloads benefit from SHARED_RUNQ, however. Workloads that
> hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m
> 52
> -p 512 -r 10 -t 1, tend to run into contention on the shard locks;
> especially when enqueuing tasks in __enqueue_entity(). This can be
> mitigated significantly by sharding the shared datastructures within
> a
> CCX, but it doesn't eliminate all contention, as described above.
> 
> Worth noting as well is that Gautham Shenoy ran some interesting
> experiments on a few more ideas in [5], such as walking the
> shared_runq
> on the pop path until a task is found that can be migrated to the
> calling CPU. I didn't run those experiments in this patch set, but it
> might be worth doing so.
> 
> [5]: 
> https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/
> 
> Gautham also ran some other benchmarks in [6], which we may want to
> again try on this v3, but with boost disabled.
> 
> [6]: 
> https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/
> 
> Finally, while SHARED_RUNQ in this form encourages work conservation,
> it
> of course does not guarantee it given that we don't implement any
> kind
> of work stealing between shared_runq's. In the future, we could
> potentially push CPU utilization even higher by enabling work
> stealing
> between shared_runq's, likely between CCXs on the same NUMA node.
> 
> Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: David Vernet <void@manifault.com>
> 
> David Vernet (7):
>   sched: Expose move_queued_task() from core.c
>   sched: Move is_cpu_allowed() into sched.h
>   sched: Check cpu_active() earlier in newidle_balance()
>   sched: Enable sched_feat callbacks on enable/disable
>   sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
>   sched: Implement shared runqueue in CFS
>   sched: Shard per-LLC shared runqueues
> 
>  include/linux/sched.h   |   2 +
>  kernel/sched/core.c     |  52 ++----
>  kernel/sched/debug.c    |  18 ++-
>  kernel/sched/fair.c     | 340
> +++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h |   1 +
>  kernel/sched/sched.h    |  56 ++++++-
>  kernel/sched/topology.c |   4 +-
>  7 files changed, 420 insertions(+), 53 deletions(-)
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
@ 2023-11-27 19:49   ` David Vernet
  2023-12-07  6:00     ` Aboorva Devarajan
  0 siblings, 1 reply; 52+ messages in thread
From: David Vernet @ 2023-11-27 19:49 UTC (permalink / raw)
  To: Aboorva Devarajan
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team, linux-kernel

On Mon, Nov 27, 2023 at 01:58:34PM +0530, Aboorva Devarajan wrote:
> On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:
> 
> Hi David,
> 
> I have been benchmarking the patch-set on POWER9 machine to understand
> its impact. However, I've run into a recurring hard-lockups in
> newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
> doesn't happen all the time, but it's something worth noting. I wanted
> to inform you about this, and I can provide more details if needed.

Hello Aboorva,

Thank you for testing out this patch set and for the report. One issue
that v4 will correct is that the shared_runq list could become corrupted
if you enable and disable the feature, as a stale task could remain in
the list after the feature has been disabled. I'll be including a fix
for that in v4, which I'm currently benchmarking, but other stuff keeps
seeming to preempt it.

By any chance, did you run into this when you were enabling / disabling
the feature? Or did you just enable it once and then hit this issue
after some time, which would indicate a different issue? I'm trying to
repro using ab, but haven't been successful thus far. If you're able to
repro consistently, it might be useful to run with CONFIG_LIST_DEBUG=y.

Thanks,
David

> -----------------------------------------
> 
> Some inital information regarding the hard-lockup:
> 
> Base Kernel:
> -----------
> 
> Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
> tick_stop when cfs bandwidth in use").
> 
> Patched Kernel:
> -------------
> 
> Base Kernel + v3 (shared runqueue patch-set)(
> https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
> )
> 
> The hard-lockup moslty occurs when running the Apache2 benchmarks with
> ab (Apache HTTP benchmarking tool) on the patched kernel. However, this
> problem is not exclusive to the mentioned benchmark and only occurs
> while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ feature
> prevents the occurrence of the lockup.
> 
> ab (Apache HTTP benchmarking tool): 
> https://httpd.apache.org/docs/2.4/programs/ab.html
> 
> Hardlockup with Patched Kernel:
> ------------------------------
> 
> [ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
> [ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
> [ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
> [ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
> [ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
> [ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
> [ 3295.182437][  C123] Call Trace:
> [ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
> [ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
> [ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
> [ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
> [ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
> [ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> [ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
> [ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
> [ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
> [ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
> [ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> [ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
> [ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:
> 
> -----------------------------------------
> 
> [ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
> [ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
> [ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
> [ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
> [ 3943.438678][  C112] Workqueue:  0x0 (events)
> [ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
> [ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
> [ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
> [ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
> [ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
> [ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
> [ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
> [ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
> [ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
> [ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
> [ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
> [ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
> [ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
> [ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
> [ 3943.438727][  C112] Call Trace:
> [ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
> [ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
> [ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
> [ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> [ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
> [ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
> [ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
> [ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
> [ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> [ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001
> 
> -----------------------------------------
> 
> System configuration:
> --------------------
> 
> # lscpu
> Architecture:                    ppc64le
> Byte Order:                      Little Endian
> CPU(s):                          128
> On-line CPU(s) list:             0-127
> Thread(s) per core:              4
> Core(s) per socket:              16
> Socket(s):                       2
> NUMA node(s):                    8
> Model:                           2.3 (pvr 004e 1203)
> Model name:                      POWER9 (raw), altivec supported
> Frequency boost:                 enabled
> CPU max MHz:                     3800.0000
> CPU min MHz:                     2300.0000
> L1d cache:                       1 MiB
> L1i cache:                       1 MiB
> NUMA node0 CPU(s):               64-127
> NUMA node8 CPU(s):               0-63
> NUMA node250 CPU(s):             
> NUMA node251 CPU(s):             
> NUMA node252 CPU(s):             
> NUMA node253 CPU(s):             
> NUMA node254 CPU(s):             
> NUMA node255 CPU(s):             
> 
> # uname -r
> 6.5.0-rc2-00028-g7475adccd76b
> 
> # cat /sys/kernel/debug/sched/features
> GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
> CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
> NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
> RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
> WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
> BASE_SLICE HZ_BW SHARED_RUNQ
> 
> -----------------------------------------
> 
> Please let me know if I've missed anything here. I'll continue
> investigating and share any additional information I find.
> 
> Thanks and Regards,
> Aboorva
> 
> 
> > Changes
> > -------
> > 
> > This is v3 of the shared runqueue patchset. This patch set is based
> > off
> > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > bandwidth in use") on the sched/core branch of tip.git.
> > 
> > v1 (RFC): 
> > https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> > v2: 
> > https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/
> > 
> > v2 -> v3 changes:
> > - Don't leave stale tasks in the lists when the SHARED_RUNQ feature
> > is
> >   disabled (Abel Wu)
> > 
> > - Use raw spin lock instead of spinlock_t (Peter)
> > 
> > - Fix return value from shared_runq_pick_next_task() to match the
> >   semantics expected by newidle_balance() (Gautham, Abel)
> > 
> > - Fold patch __enqueue_entity() / __dequeue_entity() into previous
> > patch
> >   (Peter)
> > 
> > - Skip <= LLC domains in newidle_balance() if SHARED_RUNQ is enabled
> >   (Peter)
> > 
> > - Properly support hotplug and recreating sched domains (Peter)
> > 
> > - Avoid unnecessary task_rq_unlock() + raw_spin_rq_lock() when src_rq
> > ==
> >   target_rq in shared_runq_pick_next_task() (Abel)
> > 
> > - Only issue list_del_init() in shared_runq_dequeue_task() if the
> > task
> >   is still in the list after acquiring the lock (Aaron Lu)
> > 
> > - Slightly change shared_runq_shard_idx() to make it more likely to
> > keep
> >   SMT siblings on the same bucket (Peter)
> > 
> > v1 -> v2 changes:
> > - Change name from swqueue to shared_runq (Peter)
> > 
> > - Shard per-LLC shared runqueues to avoid contention on scheduler-
> > heavy
> >   workloads (Peter)
> > 
> > - Pull tasks from the shared_runq in newidle_balance() rather than in
> >   pick_next_task_fair() (Peter and Vincent)
> > 
> > - Rename a few functions to reflect their actual purpose. For
> > example,
> >   shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> > 
> > - Expose move_queued_task() from core.c rather than migrate_task_to()
> >   (Peter)
> > 
> > - Properly check is_cpu_allowed() when pulling a task from a
> > shared_runq
> >   to ensure it can actually be migrated (Peter and Gautham)
> > 
> > - Dropped RFC tag
> > 
> > Overview
> > ========
> > 
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance
> > due
> > to e.g. decreased cache locality. The matter is further complicated
> > by
> > the topology of the system. Migrating a task between cores on the
> > same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in
> > the
> > other direction.
> > 
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will
> > choose
> > to keep a task local to a CPU, when it would have been more optimal
> > to
> > migrate it to an idle core.
> > 
> > An example of such a workload is the HHVM / web workload at Meta.
> > HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound,
> > and
> > exhibit generally poor cache locality. To try and address this, we
> > set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM
> > workloads:
> > 
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> > 
> > These knobs are intended both to encourage the scheduler to be as
> > work
> > conserving as possible (migration_cost_ns -> 0), and also to keep
> > tasks
> > running for relatively long time slices so as to avoid the overhead
> > of
> > context switching (the other knobs). Collectively, these knobs
> > provide a
> > substantial performance win; resulting in roughly a 20% improvement
> > in
> > throughput. Worth noting, however, is that this improvement is _not_
> > at
> > full machine saturation.
> > 
> > That said, even with these knobs, we noticed that CPUs were still
> > going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared runqueue" (SHARED_RUNQ) feature proposed in this patch set.
> > The
> > idea behind SHARED_RUNQ is simple: it enables the scheduler to be
> > more
> > aggressively work conserving by placing a waking task into a sharded
> > per-LLC FIFO queue that can be pulled from by another core in the LLC
> > FIFO queue which can then be pulled from before it goes idle.
> > 
> > With this simple change, we were able to achieve a 1 - 1.6%
> > improvement
> > in throughput, as well as a small, consistent improvement in p95 and
> > p99
> > latencies, in HHVM. These performance improvements were in addition
> > to
> > the wins from the debugfs knobs mentioned above, and to other
> > benchmarks
> > outlined below in the Results section.
> > 
> > Design
> > ======
> > 
> > Note that the design described here reflects sharding, which is the
> > implementation added in the final patch of the series (following the
> > initial unsharded implementation added in patch 6/7). The design is
> > described that way in this commit summary as the benchmarks described
> > in
> > the results section below all reflect a sharded SHARED_RUNQ.
> > 
> > The design of SHARED_RUNQ is quite simple. A shared_runq is simply a
> > list of struct shared_runq_shard objects, which itself is simply a
> > struct list_head of tasks, and a spinlock:
> > 
> > struct shared_runq_shard {
> > 	struct list_head list;
> > 	raw_spinlock_t lock;
> > } ____cacheline_aligned;
> > 
> > struct shared_runq {
> > 	u32 num_shards;
> > 	struct shared_runq_shard shards[];
> > } ____cacheline_aligned;
> > 
> > We create a struct shared_runq per LLC, ensuring they're in their own
> > cachelines to avoid false sharing between CPUs on different LLCs, and
> > we
> > create a number of struct shared_runq_shard objects that are housed
> > there.
> > 
> > When a task first wakes up, it enqueues itself in the
> > shared_runq_shard
> > of its current LLC at the end of enqueue_task_fair(). Enqueues only
> > happen if the task was not manually migrated to the current core by
> > select_task_rq(), and is not pinned to a specific CPU.
> > 
> > A core will pull a task from the shards in its LLC's shared_runq at
> > the
> > beginning of newidle_balance().
> > 
> > Difference between SHARED_RUNQ and SIS_NODE
> > ===========================================
> > 
> > In [0] Peter proposed a patch that addresses Tejun's observations
> > that
> > when workqueues are targeted towards a specific LLC on his Zen2
> > machine
> > with small CCXs, that there would be significant idle time due to
> > select_idle_sibling() not considering anything outside of the current
> > LLC.
> > 
> > This patch (SIS_NODE) is essentially the complement to the proposal
> > here. SID_NODE causes waking tasks to look for idle cores in
> > neighboring
> > LLCs on the same die, whereas SHARED_RUNQ causes cores about to go
> > idle
> > to look for enqueued tasks. That said, in its current form, the two
> > features at are a different scope as SIS_NODE searches for idle cores
> > between LLCs, while SHARED_RUNQ enqueues tasks within a single LLC.
> > 
> > The patch was since removed in [1], and we compared the results to
> > SHARED_RUNQ (previously called "swqueue") in [2]. SIS_NODE did not
> > outperform SHARED_RUNQ on any of the benchmarks, so we elect to not
> > compare against it again for this v2 patch set.
> > 
> > [0]: 
> > https://lore.kernel.org/all/20230530113249.GA156198@hirez.programming.kicks-ass.net/
> > [1]: 
> > https://lore.kernel.org/all/20230605175636.GA4253@hirez.programming.kicks-ass.net/
> > [2]: 
> > https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> > 
> > Worth noting as well is that pointed out in [3] that the logic behind
> > including SIS_NODE in the first place should apply to SHARED_RUNQ
> > (meaning that e.g. very small Zen2 CPUs with only 3/4 cores per LLC
> > should benefit from having a single shared_runq stretch across
> > multiple
> > LLCs). I drafted a patch that implements this by having a minimum LLC
> > size for creating a shard, and stretches a shared_runq across
> > multiple
> > LLCs if they're smaller than that size, and sent it to Tejun to test
> > on
> > his Zen2. Tejun reported back that SIS_NODE did not seem to make a
> > difference:
> > 
> > [3]: 
> > https://lore.kernel.org/lkml/20230711114207.GK3062772@hirez.programming.kicks-ass.net/
> > 
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > Vanilla:		    | 108.84s    | 0.0057   |
> > NO_SHARED_RUNQ:		    | 108.82s    | 0.119s   |
> > SHARED_RUNQ:		    | 108.17s    | 0.038s   |
> > SHARED_RUNQ w/ SIS_NODE:    | 108.87s    | 0.111s   |
> > 			    o------------o----------o
> > 
> > I similarly tried running kcompile on SHARED_RUNQ with SIS_NODE on my
> > 7950X Zen3, but didn't see any gain relative to plain SHARED_RUNQ
> > (though
> > a gain was observed relative to NO_SHARED_RUNQ, as described below).
> > 
> > Results
> > =======
> > 
> > Note that the motivation for the shared runqueue feature was
> > originally
> > arrived at using experiments in the sched_ext framework that's
> > currently
> > being proposed upstream. The ~1 - 1.6% improvement in HHVM throughput
> > is similarly visible using work-conserving sched_ext schedulers (even
> > very simple ones like global FIFO).
> > 
> > In both single and multi socket / CCX hosts, this can measurably
> > improve
> > performance. In addition to the performance gains observed on our
> > internal web workloads, we also observed an improvement in common
> > workloads such as kernel compile and hackbench, when running shared
> > runqueue.
> > 
> > On the other hand, some workloads suffer from SHARED_RUNQ. Workloads
> > that hammer the runqueue hard, such as netperf UDP_RR, or schbench -L
> > -m 52 -p 512 -r 10 -t 1. This can be mitigated somewhat by sharding
> > the
> > shared datastructures within a CCX, but it doesn't seem to eliminate
> > all
> > contention in every scenario. On the positive side, it seems that
> > sharding does not materially harm the benchmarks run for this patch
> > series; and in fact seems to improve some workloads such as kernel
> > compile.
> > 
> > Note that for the kernel compile workloads below, the compilation was
> > done by running make -j$(nproc) built-in.a on several different types
> > of
> > hosts configured with make allyesconfig on commit a27648c74210 ("afs:
> > Fix setting of mtime when creating a file/dir/symlink") on Linus'
> > tree
> > (boost and turbo were disabled on all of these hosts when the
> > experiments were performed).
> > 
> > Finally, note that these results were from the patch set built off of
> > commit ebb83d84e49b ("sched/core: Avoid multiple calling
> > update_rq_clock() in __cfsb_csd_unthrottle()") on the sched/core
> > branch
> > of tip.git for easy comparison with the v2 patch set results. The
> > patches in their final form from this set were rebased onto commit
> > 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in
> > use") on the sched/core branch of tip.git.
> > 
> > === Single-socket | 16 core / 32 thread | 2-CCX | AMD 7950X Zen4 ===
> > 
> > CPU max MHz: 5879.8818
> > CPU min MHz: 3000.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 581.95s    | 2.639s   |
> > SHARED_RUNQ:		    | 577.02s    | 0.084s   |
> > 			    o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ results in a statistically significant ~.85%
> > improvement over NO_SHARED_RUNQ. This suggests that enqueuing tasks
> > in
> > the shared runqueue on every enqueue improves work conservation, and
> > thanks to sharding, does not result in contention.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 2.2492s    | .00001s  |
> > SHARED_RUNQ:		    | 2.0217s    | .00065s  |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ in both forms performs exceptionally well
> > compared
> > to NO_SHARED_RUNQ here, beating it by over 10%. This was a surprising
> > result given that it seems advantageous to err on the side of
> > avoiding
> > migration in hackbench given that tasks are short lived in sending
> > only
> > 10k bytes worth of messages, but the results of the benchmark would
> > seem
> > to suggest that minimizing runqueue delays is preferable.
> > 
> > Command:
> > for i in `seq 128`; do
> >     netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 25037.45   | 2243.44  |
> > SHARED_RUNQ:                | 24952.50   | 1268.06  |
> >                             o-----------------------o
> > 
> > Takeaway: No statistical significance, though it is worth noting that
> > there is no regression for shared runqueue on the 7950X, while there
> > is
> > a small regression on the Skylake and Milan hosts for SHARED_RUNQ as
> > described below.
> > 
> > === Single-socket | 18 core / 36 thread | 1-CCX | Intel Skylake ===
> > 
> > CPU max MHz: 1601.0000
> > CPU min MHz: 800.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 1517.44s   | 2.8322s  |
> > SHARED_RUNQ:		    | 1516.51s   | 2.9450s  |
> > 			    o------------o----------o
> > 
> > Takeaway: There's on statistically significant gain here. I observed
> > what I claimed was a .23% win in v2, but it appears that this is not
> > actually statistically significant.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 5.3370s    | .0012s   |
> > SHARED_RUNQ:		    | 5.2668s    | .0033s   |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ results in a ~1.3% improvement over
> > NO_SHARED_RUNQ. Also statistically significant, but smaller than the
> > 10+% improvement observed on the 7950X.
> > 
> > Command: netperf -n $(nproc) -l 60 -t TCP_RR
> > for i in `seq 128`; do
> >         netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 15699.32   | 377.01   |
> > SHARED_RUNQ:                | 14966.42   | 714.13   |
> >                             o-----------------------o
> > 
> > Takeaway: NO_SHARED_RUNQ beats SHARED_RUNQ by ~4.6%. This result
> > makes
> > sense -- the workload is very heavy on the runqueue, so enqueuing
> > tasks
> > in the shared runqueue in __enqueue_entity() would intuitively result
> > in
> > increased contention on the shard lock.
> > 
> > === Single-socket | 72-core | 6-CCX | AMD Milan Zen3 ===
> > 
> > CPU max MHz: 700.0000
> > CPU min MHz: 700.0000
> > 
> > Command: make -j$(nproc) built-in.a
> > 			    o____________o__________o
> > 			    |    mean    | Variance |
> > 			    o------------o----------o
> > NO_SHARED_RUNQ:		    | 1568.55s   | 0.1568s  |
> > SHARED_RUNQ:		    | 1568.26s   | 1.2168s  |
> > 			    o------------o----------o
> > 
> > Takeaway: No statistically significant difference here. It might be
> > worth experimenting with work stealing in a follow-on patch set.
> > 
> > Command: hackbench --loops 10000
> >                             o____________o__________o
> >                             |    mean    | Variance |
> >                             o------------o----------o
> > NO_SHARED_RUNQ:             | 5.2716s    | .00143s  |
> > SHARED_RUNQ:		    | 5.1716s    | .00289s  |
> >                             o------------o----------o
> > 
> > Takeaway: SHARED_RUNQ again wins, by about 2%.
> > 
> > Command: netperf -n $(nproc) -l 60 -t TCP_RR
> > for i in `seq 128`; do
> >         netperf -6 -t UDP_RR -c -C -l $runtime &
> > done
> >                             o_______________________o
> >                             | Throughput | Variance |
> >                             o-----------------------o
> > NO_SHARED_RUNQ:             | 17482.03   | 4675.99  |
> > SHARED_RUNQ:                | 16697.25   | 9812.23  |
> >                             o-----------------------o
> > 
> > Takeaway: Similar to the Skylake runs, NO_SHARED_RUNQ still beats
> > SHARED_RUNQ, this time by ~4.5%. It's worth noting that in v2, the
> > NO_SHARED_RUNQ was only ~1.8% faster. The variance is very high here,
> > so
> > the results of this benchmark should be taken with a large grain of
> > salt (noting that we do consistently see NO_SHARED_RUNQ on top due to
> > not contending on the shard lock).
> > 
> > Finally, let's look at how sharding affects the following schbench
> > incantation suggested by Chris in [4]:
> > 
> > schbench -L -m 52 -p 512 -r 10 -t 1
> > 
> > [4]: 
> > https://lore.kernel.org/lkml/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/
> > 
> > The TL;DR is that sharding improves things a lot, but doesn't
> > completely
> > fix the problem. Here are the results from running the schbench
> > command
> > on the 18 core / 36 thread single CCX, single-socket Skylake:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------------
> > class name         con-bounces    contentions       waittime-
> > min   waittime-max waittime-total   waittime-avg    acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------------
> > 
> > &shard-
> > >lock:      31510503       31510711           0.08          19.98    
> >     168932319.64     5.36            31700383      31843851       0.0
> > 3           17.50        10273968.33      0.32
> > ------------
> > &shard->lock       15731657          [<0000000068c0fd75>]
> > pick_next_task_fair+0x4dd/0x510
> > &shard->lock       15756516          [<000000001faf84f9>]
> > enqueue_task_fair+0x459/0x530
> > &shard->lock          21766          [<00000000126ec6ab>]
> > newidle_balance+0x45a/0x650
> > &shard->lock            772          [<000000002886c365>]
> > dequeue_task_fair+0x4c9/0x540
> > ------------
> > &shard->lock          23458          [<00000000126ec6ab>]
> > newidle_balance+0x45a/0x650
> > &shard->lock       16505108          [<000000001faf84f9>]
> > enqueue_task_fair+0x459/0x530
> > &shard->lock       14981310          [<0000000068c0fd75>]
> > pick_next_task_fair+0x4dd/0x510
> > &shard->lock            835          [<000000002886c365>]
> > dequeue_task_fair+0x4c9/0x540
> > 
> > These results are when we create only 3 shards (16 logical cores per
> > shard), so the contention may be a result of overly-coarse sharding.
> > If
> > we run the schbench incantation with no sharding whatsoever, we see
> > the
> > following significantly worse lock stats contention:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----
> > class name        con-bounces    contentions         waittime-
> > min   waittime-max waittime-total         waittime-avg    acq-
> > bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----
> > 
> > &shard-
> > >lock:     117868635      118361486           0.09           393.01  
> >      1250954097.25          10.57           119345882     119780601  
> >     0.05          343.35       38313419.51      0.32
> > ------------
> > &shard->lock       59169196          [<0000000060507011>]
> > __enqueue_entity+0xdc/0x110
> > &shard->lock       59084239          [<00000000f1c67316>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock         108051          [<00000000084a6193>]
> > newidle_balance+0x45a/0x650
> > ------------
> > &shard->lock       60028355          [<0000000060507011>]
> > __enqueue_entity+0xdc/0x110
> > &shard->lock         119882          [<00000000084a6193>]
> > newidle_balance+0x45a/0x650
> > &shard->lock       58213249          [<00000000f1c67316>]
> > __dequeue_entity+0x78/0xa0
> > 
> > The contention is ~3-4x worse if we don't shard at all. This roughly
> > matches the fact that we had 3 shards on the first workload run
> > above.
> > If we make the shards even smaller, the contention is comparably much
> > lower:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----------------------------------------------------------
> > class name         con-bounces    contentions   waittime-
> > min  waittime-max waittime-total   waittime-avg   acq-
> > bounces   acquisitions   holdtime-min  holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ----------------------------------------------------------
> > 
> > &shard-
> > >lock:      13839849       13877596      0.08          13.23        5
> > 389564.95       0.39           46910241      48069307       0.06     
> >      16.40        16534469.35      0.34
> > ------------
> > &shard->lock           3559          [<00000000ea455dcc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        6992418          [<000000002266f400>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        6881619          [<000000002a62f2e0>]
> > __enqueue_entity+0xdc/0x110
> > ------------
> > &shard->lock        6640140          [<000000002266f400>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock           3523          [<00000000ea455dcc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        7233933          [<000000002a62f2e0>]
> > __enqueue_entity+0xdc/0x110
> > 
> > Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the
> > schbench
> > benchmark on Milan as well, but we contend more on the rq lock than
> > the
> > shard lock:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------
> > class name         con-bounces    contentions   waittime-
> > min  waittime-max waittime-total   waittime-avg   acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > -----------------------------------------------------------
> > 
> > &rq-
> > >__lock:       9617614        9656091       0.10          79.64      
> >   69665812.00      7.21           18092700      67652829       0.11  
> >          82.38        344524858.87     5.09
> > -----------
> > &rq->__lock        6301611          [<000000003e63bf26>]
> > task_rq_lock+0x43/0xe0
> > &rq->__lock        2530807          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock         109360          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock         178218          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > -----------
> > &rq->__lock        3245506          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock        1294355          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > &rq->__lock        2837804          [<000000003e63bf26>]
> > task_rq_lock+0x43/0xe0
> > &rq->__lock        1627866          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > 
> > .....................................................................
> > .....................................................................
> > ........................................................
> > 
> > &shard-
> > >lock:       7338558       7343244       0.10          35.97        7
> > 173949.14       0.98           30200858      32679623       0.08     
> >       35.59        16270584.52      0.50
> > ------------
> > &shard->lock        2004142          [<00000000f8aa2c91>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        2611264          [<00000000473978cc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        2727838          [<0000000028f55bb5>]
> > __enqueue_entity+0xdc/0x110
> > ------------
> > &shard->lock        2737232          [<00000000473978cc>]
> > newidle_balance+0x45a/0x650
> > &shard->lock        1693341          [<00000000f8aa2c91>]
> > __dequeue_entity+0x78/0xa0
> > &shard->lock        2912671          [<0000000028f55bb5>]
> > __enqueue_entity+0xdc/0x110
> > 
> > .....................................................................
> > .....................................................................
> > .........................................................
> > 
> > If we look at the lock stats with SHARED_RUNQ disabled, the rq lock
> > still
> > contends the most, but it's significantly less than with it enabled:
> > 
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > --------------------------------------------------------------
> > class name          con-bounces    contentions   waittime-
> > min   waittime-max waittime-total   waittime-avg    acq-
> > bounces   acquisitions   holdtime-min   holdtime-max holdtime-
> > total   holdtime-avg
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > --------------------------------------------------------------
> > 
> > &rq-
> > >__lock:        791277         791690        0.12           110.54   
> >     4889787.63       6.18            1575996       62390275       0.1
> > 3           112.66       316262440.56     5.07
> > -----------
> > &rq->__lock         263343          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock          19394          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock           4143          [<000000003b542e83>]
> > __task_rq_lock+0x51/0xf0
> > &rq->__lock          51094          [<00000000c38a30f9>]
> > sched_ttwu_pending+0x3d/0x170
> > -----------
> > &rq->__lock          23756          [<0000000011be1562>]
> > raw_spin_rq_lock_nested+0xa/0x10
> > &rq->__lock         379048          [<00000000516703f0>]
> > __schedule+0x72/0xaa0
> > &rq->__lock            677          [<000000003b542e83>]
> > __task_rq_lock+0x51/0xf0
> > 
> > Worth noting is that increasing the granularity of the shards in
> > general
> > improves very runqueue-heavy workloads such as netperf UDP_RR and
> > this
> > schbench command, but it doesn't necessarily make a big difference
> > for
> > every workload, or for sufficiently small CCXs such as the 7950X. It
> > may
> > make sense to eventually allow users to control this with a debugfs
> > knob, but for now we'll elect to choose a default that resulted in
> > good
> > performance for the benchmarks run for this patch series.
> > 
> > Conclusion
> > ==========
> > 
> > SHARED_RUNQ in this form provides statistically significant wins for
> > several types of workloads, and various CPU topologies. The reason
> > for
> > this is roughly the same for all workloads: SHARED_RUNQ encourages
> > work
> > conservation inside of a CCX by having a CPU do an O(# per-LLC
> > shards)
> > iteration over the shared_runq shards in an LLC. We could similarly
> > do
> > an O(n) iteration over all of the runqueues in the current LLC when a
> > core is going idle, but that's quite costly (especially for larger
> > LLCs), and sharded SHARED_RUNQ seems to provide a performant middle
> > ground between doing an O(n) walk, and doing an O(1) pull from a
> > single
> > per-LLC shared runq.
> > 
> > For the workloads above, kernel compile and hackbench were clear
> > winners
> > for SHARED_RUNQ (especially in __enqueue_entity()). The reason for
> > the
> > improvement in kernel compile is of course that we have a heavily
> > CPU-bound workload where cache locality doesn't mean much; getting a
> > CPU
> > is the #1 goal. As mentioned above, while I didn't expect to see an
> > improvement in hackbench, the results of the benchmark suggest that
> > minimizing runqueue delays is preferable to optimizing for L1/L2
> > locality.
> > 
> > Not all workloads benefit from SHARED_RUNQ, however. Workloads that
> > hammer the runqueue hard, such as netperf UDP_RR, or schbench -L -m
> > 52
> > -p 512 -r 10 -t 1, tend to run into contention on the shard locks;
> > especially when enqueuing tasks in __enqueue_entity(). This can be
> > mitigated significantly by sharding the shared datastructures within
> > a
> > CCX, but it doesn't eliminate all contention, as described above.
> > 
> > Worth noting as well is that Gautham Shenoy ran some interesting
> > experiments on a few more ideas in [5], such as walking the
> > shared_runq
> > on the pop path until a task is found that can be migrated to the
> > calling CPU. I didn't run those experiments in this patch set, but it
> > might be worth doing so.
> > 
> > [5]: 
> > https://lore.kernel.org/lkml/ZJkqeXkPJMTl49GB@BLR-5CG11610CF.amd.com/
> > 
> > Gautham also ran some other benchmarks in [6], which we may want to
> > again try on this v3, but with boost disabled.
> > 
> > [6]: 
> > https://lore.kernel.org/lkml/ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com/
> > 
> > Finally, while SHARED_RUNQ in this form encourages work conservation,
> > it
> > of course does not guarantee it given that we don't implement any
> > kind
> > of work stealing between shared_runq's. In the future, we could
> > potentially push CPU utilization even higher by enabling work
> > stealing
> > between shared_runq's, likely between CCXs on the same NUMA node.
> > 
> > Originally-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Signed-off-by: David Vernet <void@manifault.com>
> > 
> > David Vernet (7):
> >   sched: Expose move_queued_task() from core.c
> >   sched: Move is_cpu_allowed() into sched.h
> >   sched: Check cpu_active() earlier in newidle_balance()
> >   sched: Enable sched_feat callbacks on enable/disable
> >   sched/fair: Add SHARED_RUNQ sched feature and skeleton calls
> >   sched: Implement shared runqueue in CFS
> >   sched: Shard per-LLC shared runqueues
> > 
> >  include/linux/sched.h   |   2 +
> >  kernel/sched/core.c     |  52 ++----
> >  kernel/sched/debug.c    |  18 ++-
> >  kernel/sched/fair.c     | 340
> > +++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched/features.h |   1 +
> >  kernel/sched/sched.h    |  56 ++++++-
> >  kernel/sched/topology.c |   4 +-
> >  7 files changed, 420 insertions(+), 53 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
                   ` (8 preceding siblings ...)
  2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
@ 2023-12-04 19:30 ` David Vernet
  9 siblings, 0 replies; 52+ messages in thread
From: David Vernet @ 2023-12-04 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team, aboorvad, yu.c.chen

On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> Changes
> -------
> 
> This is v3 of the shared runqueue patchset. This patch set is based off
> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> bandwidth in use") on the sched/core branch of tip.git.
> 
> v1 (RFC): https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/
> v2: https://lore.kernel.org/lkml/20230710200342.358255-1-void@manifault.com/

Hello everyone,

I wanted to give an update on this, as I've been promising a v4 of the
patch set for quite some time. I ran more experiments over the last few
weeks, and EEVDF has changed the performance profile of the SHARED_RUNQ
feature quite a bit compared to what we were observing with CFS. We may
pick this back up again at a later point, but for now we're going to
take a step back and re-evaluate.

Thanks,
David

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS
  2023-11-27 19:49   ` David Vernet
@ 2023-12-07  6:00     ` Aboorva Devarajan
  0 siblings, 0 replies; 52+ messages in thread
From: Aboorva Devarajan @ 2023-12-07  6:00 UTC (permalink / raw)
  To: David Vernet
  Cc: peterz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, tj, roman.gushchin,
	gautham.shenoy, kprateek.nayak, aaron.lu, wuyun.abel,
	kernel-team, linux-kernel

On Mon, 2023-11-27 at 13:49 -0600, David Vernet wrote:
> On Mon, Nov 27, 2023 at 01:58:34PM +0530, Aboorva Devarajan wrote:
> > On Wed, 2023-08-09 at 17:12 -0500, David Vernet wrote:
> > 
> > Hi David,
> > 
> > I have been benchmarking the patch-set on POWER9 machine to understand
> > its impact. However, I've run into a recurring hard-lockups in
> > newidle_balance, specifically when SHARED_RUNQ feature is enabled. It
> > doesn't happen all the time, but it's something worth noting. I wanted
> > to inform you about this, and I can provide more details if needed.
> 
> Hello Aboorva,
> 
> Thank you for testing out this patch set and for the report. One issue
> that v4 will correct is that the shared_runq list could become corrupted
> if you enable and disable the feature, as a stale task could remain in
> the list after the feature has been disabled. I'll be including a fix
> for that in v4, which I'm currently benchmarking, but other stuff keeps
> seeming to preempt it.

Hi David,

Thank you for your response. While testing, I did observe the
shared_runq list becoming corrupted when enabling and disabling the
feature. 

Please find the logs below with CONFIG_DEBUG_LIST enabled:
------------------------------------------

[ 4952.270819] list_add corruption. prev->next should be next (c0000003fae87a80), but was c0000000ba027ec8. (prev=c0000000ba027ec8).
[ 4952.270926] ------------[ cut here ]------------
[ 4952.270935] kernel BUG at lib/list_debug.c:30!
[ 4952.270947] Oops: Exception in kernel mode, sig: 5 [#1]
[ 4952.270956] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 4952.271029] CPU: 10 PID: 31426 Comm: cc1 Kdump: loaded Not tainted 6.5.0-rc2+ #1
[ 4952.271042] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
[ 4952.271054] NIP:  c000000000872f88 LR: c000000000872f84 CTR: 00000000006d1a1c
[ 4952.271070] REGS: c00000006e1b34e0 TRAP: 0700   Not tainted  (6.5.0-rc2+)
[ 4952.271079] MSR:  8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 28048222  XER: 00000006
[ 4952.271102] CFAR: c0000000001ffa24 IRQMASK: 1 
[ 4952.271102] GPR00: c000000000872f84 c00000006e1b3780 c0000000019a3b00 0000000000000075 
[ 4952.271102] GPR04: c0000003faff2c08 c0000003fb077e80 c00000006e1b35c8 00000003f8e70000 
[ 4952.271102] GPR08: 0000000000000027 c000000002185f30 00000003f8e70000 0000000000000001 
[ 4952.271102] GPR12: 0000000000000000 c0000003fffe2c80 c000000068ecb100 0000000000000000 
[ 4952.271102] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271102] GPR20: 0000000000000000 0000000000000000 0000000000000041 c00000006e1b3bb0 
[ 4952.271102] GPR24: c000000002c72058 00000003f8e70000 0000000000000001 c00000000e919948 
[ 4952.271102] GPR28: c0000000ba027ec8 c0000003fae87a80 c000000080ce6c00 c00000000e919980 
[ 4952.271212] NIP [c000000000872f88] __list_add_valid+0xb8/0x100
[ 4952.271236] LR [c000000000872f84] __list_add_valid+0xb4/0x100
[ 4952.271248] Call Trace:
[ 4952.271254] [c00000006e1b3780] [c000000000872f84] __list_add_valid+0xb4/0x100 (unreliable)
[ 4952.271270] [c00000006e1b37e0] [c0000000001b8f50] __enqueue_entity+0x110/0x1c0
[ 4952.271288] [c00000006e1b3830] [c0000000001bec9c] enqueue_entity+0x16c/0x690
[ 4952.271301] [c00000006e1b38e0] [c0000000001bf280] enqueue_task_fair+0xc0/0x490
[ 4952.271315] [c00000006e1b3980] [c0000000001ada0c] ttwu_do_activate+0xac/0x410
[ 4952.271328] [c00000006e1b3a10] [c0000000001ae59c] try_to_wake_up+0x5fc/0x8b0
[ 4952.271341] [c00000006e1b3ae0] [c0000000001df6dc] autoremove_wake_function+0x2c/0xc0
[ 4952.271359] [c00000006e1b3b20] [c0000000001e1018] __wake_up_common+0xc8/0x240
[ 4952.271372] [c00000006e1b3b90] [c0000000001e123c] __wake_up_common_lock+0xac/0x120
[ 4952.271385] [c00000006e1b3c20] [c0000000005bd4a4] pipe_write+0xd4/0x980
[ 4952.271401] [c00000006e1b3d00] [c0000000005ad720] vfs_write+0x350/0x4b0
[ 4952.271420] [c00000006e1b3dc0] [c0000000005adc24] ksys_write+0xf4/0x140
[ 4952.271433] [c00000006e1b3e10] [c000000000031108] system_call_exception+0x128/0x340
[ 4952.271449] [c00000006e1b3e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
[ 4952.271470] --- interrupt: 3000 at 0x7fff8df3aa34
[ 4952.271482] NIP:  00007fff8df3aa34 LR: 0000000000000000 CTR: 0000000000000000
[ 4952.271492] REGS: c00000006e1b3e80 TRAP: 3000   Not tainted  (6.5.0-rc2+)
[ 4952.271502] MSR:  800000000000f033 <SF,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44002822  XER: 00000000
[ 4952.271526] IRQMASK: 0 
[ 4952.271526] GPR00: 0000000000000004 00007fffea094d00 0000000112467a00 0000000000000001 
[ 4952.271526] GPR04: 0000000132c6a810 0000000000002000 00000000000004e4 0000000000000036 
[ 4952.271526] GPR08: 0000000132c6c810 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR12: 0000000000000000 00007fff8e71cac0 0000000000000000 0000000000000000 
[ 4952.271526] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[ 4952.271526] GPR20: 00007fffea09c76f 00000001123b6898 0000000000000003 0000000132c6c820 
[ 4952.271526] GPR24: 0000000112469d88 00000001124686b8 0000000132c6a810 0000000000002000 
[ 4952.271526] GPR28: 0000000000002000 00007fff8e0418e0 0000000132c6a810 0000000000002000 
[ 4952.271627] NIP [00007fff8df3aa34] 0x7fff8df3aa34
[ 4952.271635] LR [0000000000000000] 0x0
[ 4952.271642] --- interrupt: 3000
[ 4952.271648] Code: f8010070 4b98ca81 60000000 0fe00000 7c0802a6 3c62ffa6 7d064378 7d244b78 38637f68 f8010070 4b98ca5d 60000000 <0fe00000> 7c0802a6 3c62ffa6 7ca62b78 
[ 4952.271685] ---[ end trace 0000000000000000 ]---
[ 4952.282562] pstore: backend (nvram) writing error (-1)
------------------------------------------

> 
> By any chance, did you run into this when you were enabling / disabling
> the feature? Or did you just enable it once and then hit this issue
> after some time, which would indicate a different issue? I'm trying to
> repro using ab, but haven't been successful thus far. If you're able to
> repro consistently, it might be useful to run with CONFIG_LIST_DEBUG=y.
> 

Additionally, I noticed a sporadic issue persisting even after enabling
the feature once, and the issue surfaced over time.  However, it
occurred specifically on a particular system, and my attempts to
recreate it were unsuccessful. I will provide more details if I can
successfully reproduce the issue with debug enabled. But looks like the
primary issue revolves around the shared_runq list getting corrupted
when toggling the feature on and off repeatedly as you pointed out.

I will keep an eye out for v4 and test if it's available later.

Thanks,
Aboorva


> Thanks,
> David
> 
> > -----------------------------------------
> > 
> > Some inital information regarding the hard-lockup:
> > 
> > Base Kernel:
> > -----------
> > 
> > Base kernel is upto commit 88c56cfeaec4 ("sched/fair: Block nohz
> > tick_stop when cfs bandwidth in use").
> > 
> > Patched Kernel:
> > -------------
> > 
> > Base Kernel + v3 (shared runqueue patch-set)(
> > https://lore.kernel.org/all/20230809221218.163894-1-void@manifault.com/
> > )
> > 
> > The hard-lockup moslty occurs when running the Apache2 benchmarks
> > with
> > ab (Apache HTTP benchmarking tool) on the patched kernel. However,
> > this
> > problem is not exclusive to the mentioned benchmark and only occurs
> > while the SHARED_RUNQ feature is enabled. Disabling SHARED_RUNQ
> > feature
> > prevents the occurrence of the lockup.
> > 
> > ab (Apache HTTP benchmarking tool): 
> > https://httpd.apache.org/docs/2.4/programs/ab.html
> > 
> > Hardlockup with Patched Kernel:
> > ------------------------------
> > 
> > [ 3289.727912][  C123] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [ 3289.727943][  C123] rcu: 	124-...0: (1 GPs behind) idle=f174/1/0x4000000000000000 softirq=12283/12289 fqs=732
> > [ 3289.727976][  C123] rcu: 	(detected by 123, t=2103 jiffies, g=127061, q=5517 ncpus=128)
> > [ 3289.728008][  C123] Sending NMI from CPU 123 to CPUs 124:
> > [ 3295.182378][  C123] CPU 124 didn't respond to backtrace IPI, inspecting paca.
> > [ 3295.182403][  C123] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 15 (ksoftirqd/124)
> > [ 3295.182421][  C123] Back trace of paca->saved_r1 (0xc000000de13e79b0) (possibly stale):
> > [ 3295.182437][  C123] Call Trace:
> > [ 3295.182456][  C123] [c000000de13e79b0] [c000000de13e7a70] 0xc000000de13e7a70 (unreliable)
> > [ 3295.182477][  C123] [c000000de13e7ac0] [0000000000000008] 0x8
> > [ 3295.182500][  C123] [c000000de13e7b70] [c000000de13e7c98] 0xc000000de13e7c98
> > [ 3295.182519][  C123] [c000000de13e7ba0] [c0000000001da8bc] move_queued_task+0x14c/0x280
> > [ 3295.182557][  C123] [c000000de13e7c30] [c0000000001f22d8] newidle_balance+0x648/0x940
> > [ 3295.182602][  C123] [c000000de13e7d30] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3295.182647][  C123] [c000000de13e7dd0] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3295.182675][  C123] [c000000de13e7ec0] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3295.182694][  C123] [c000000de13e7f30] [c0000000001c4994] smpboot_thread_fn+0x244/0x250
> > [ 3295.182731][  C123] [c000000de13e7f90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3295.182769][  C123] [c000000de13e7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3295.182806][  C123] rcu: rcu_sched kthread starved for 544 jiffies! g127061 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=66
> > [ 3295.182845][  C123] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> > [ 3295.182878][  C123] rcu: RCU grace-period kthread stack dump:
> > 
> > -----------------------------------------
> > 
> > [ 3943.438625][  C112] watchdog: CPU 112 self-detected hard LOCKUP @ _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438631][  C112] watchdog: CPU 112 TB:115060212303626, last heartbeat TB:115054309631589 (11528ms ago)
> > [ 3943.438673][  C112] CPU: 112 PID: 2090 Comm: kworker/112:2 Tainted: G        W    L     6.5.0-rc2-00028-g7475adccd76b #51
> > [ 3943.438676][  C112] Hardware name: 8335-GTW POWER9 (raw) 0x4e1203 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
> > [ 3943.438678][  C112] Workqueue:  0x0 (events)
> > [ 3943.438682][  C112] NIP:  c0000000010ff01c LR: c0000000001d1064 CTR: c0000000001e8580
> > [ 3943.438684][  C112] REGS: c000007fffb6bd60 TRAP: 0900   Tainted: G        W    L      (6.5.0-rc2-00028-g7475adccd76b)
> > [ 3943.438686][  C112] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24082222  XER: 00000000
> > [ 3943.438693][  C112] CFAR: 0000000000000000 IRQMASK: 1 
> > [ 3943.438693][  C112] GPR00: c0000000001d1064 c000000e16d1fb20 c0000000014e8200 c000000e092fed3c 
> > [ 3943.438693][  C112] GPR04: c000000e16d1fc58 c000000e092fe3c8 00000000000000e1 fffffffffffe0000 
> > [ 3943.438693][  C112] GPR08: 0000000000000000 00000000000000e1 0000000000000000 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR12: 0000000024088222 c000007ffffb8300 c0000000001bc5b8 c000000deb46f740 
> > [ 3943.438693][  C112] GPR16: 0000000000000008 c000000e092fe280 0000000000000001 c000007ffedd7b00 
> > [ 3943.438693][  C112] GPR20: 0000000000000001 c0000000029a1280 0000000000000000 0000000000000001 
> > [ 3943.438693][  C112] GPR24: 0000000000000000 c000000e092fed3c c000000e16d1fdf0 c00000000299ccd8 
> > [ 3943.438693][  C112] GPR28: c000000e16d1fc58 c0000000021fbf00 c000007ffee6bf00 0000000000000001 
> > [ 3943.438722][  C112] NIP [c0000000010ff01c] _raw_spin_lock_irqsave+0x4c/0xc0
> > [ 3943.438725][  C112] LR [c0000000001d1064] task_rq_lock+0x64/0x1b0
> > [ 3943.438727][  C112] Call Trace:
> > [ 3943.438728][  C112] [c000000e16d1fb20] [c000000e16d1fb60] 0xc000000e16d1fb60 (unreliable)
> > [ 3943.438731][  C112] [c000000e16d1fb50] [c000000e16d1fbf0] 0xc000000e16d1fbf0
> > [ 3943.438733][  C112] [c000000e16d1fbf0] [c0000000001f214c] newidle_balance+0x4bc/0x940
> > [ 3943.438737][  C112] [c000000e16d1fcf0] [c0000000001f26ac] pick_next_task_fair+0x7c/0x680
> > [ 3943.438739][  C112] [c000000e16d1fd90] [c0000000010f175c] __schedule+0x15c/0x1040
> > [ 3943.438743][  C112] [c000000e16d1fe80] [c0000000010f26b4] schedule+0x74/0x140
> > [ 3943.438747][  C112] [c000000e16d1fef0] [c0000000001afd44] worker_thread+0x134/0x580
> > [ 3943.438749][  C112] [c000000e16d1ff90] [c0000000001bc6e8] kthread+0x138/0x140
> > [ 3943.438753][  C112] [c000000e16d1ffe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
> > [ 3943.438756][  C112] Code: 63e90001 992d0932 a12d0008 3ce0fffe 5529083c 61290001 7d001
> > 
> > -----------------------------------------
> > 
> > System configuration:
> > --------------------
> > 
> > # lscpu
> > Architecture:                    ppc64le
> > Byte Order:                      Little Endian
> > CPU(s):                          128
> > On-line CPU(s) list:             0-127
> > Thread(s) per core:              4
> > Core(s) per socket:              16
> > Socket(s):                       2
> > NUMA node(s):                    8
> > Model:                           2.3 (pvr 004e 1203)
> > Model name:                      POWER9 (raw), altivec supported
> > Frequency boost:                 enabled
> > CPU max MHz:                     3800.0000
> > CPU min MHz:                     2300.0000
> > L1d cache:                       1 MiB
> > L1i cache:                       1 MiB
> > NUMA node0 CPU(s):               64-127
> > NUMA node8 CPU(s):               0-63
> > NUMA node250 CPU(s):             
> > NUMA node251 CPU(s):             
> > NUMA node252 CPU(s):             
> > NUMA node253 CPU(s):             
> > NUMA node254 CPU(s):             
> > NUMA node255 CPU(s):             
> > 
> > # uname -r
> > 6.5.0-rc2-00028-g7475adccd76b
> > 
> > # cat /sys/kernel/debug/sched/features
> > GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
> > CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_HRTICK_DL NO_DOUBLE_TICK
> > NONTASK_CAPACITY TTWU_QUEUE NO_SIS_PROP SIS_UTIL NO_WARN_DOUBLE_CLOCK
> > RT_PUSH_IPI NO_RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE
> > WA_WEIGHT WA_BIAS UTIL_EST UTIL_EST_FASTUP NO_LATENCY_WARN ALT_PERIOD
> > BASE_SLICE HZ_BW SHARED_RUNQ
> > 
> > -----------------------------------------
> > 
> > Please let me know if I've missed anything here. I'll continue
> > investigating and share any additional information I find.
> > 
> > Thanks and Regards,
> > Aboorva
> > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2023-12-07  6:02 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-09 22:12 [PATCH v3 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-09 22:12 ` [PATCH v3 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-08-09 22:12 ` [PATCH v3 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-08-09 22:12 ` [PATCH v3 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-08-09 22:12 ` [PATCH v3 4/7] sched: Enable sched_feat callbacks on enable/disable David Vernet
2023-08-09 22:12 ` [PATCH v3 5/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-08-09 22:12 ` [PATCH v3 6/7] sched: Implement shared runqueue in CFS David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-10  7:41   ` kernel test robot
2023-08-30  6:46   ` K Prateek Nayak
2023-08-31  1:34     ` David Vernet
2023-08-31  3:47       ` K Prateek Nayak
2023-08-09 22:12 ` [PATCH v3 7/7] sched: Shard per-LLC shared runqueues David Vernet
2023-08-09 23:46   ` kernel test robot
2023-08-10  0:12     ` David Vernet
2023-08-10  7:11   ` kernel test robot
2023-08-30  6:17   ` Chen Yu
2023-08-31  0:01     ` David Vernet
2023-08-31 10:45       ` Chen Yu
2023-08-31 19:14         ` David Vernet
2023-09-23  6:35           ` Chen Yu
2023-08-17  8:42 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Gautham R. Shenoy
2023-08-18  5:03   ` David Vernet
2023-08-18  8:49     ` Gautham R. Shenoy
2023-08-24 11:14       ` Gautham R. Shenoy
2023-08-24 22:51         ` David Vernet
2023-08-30  9:56           ` K Prateek Nayak
2023-08-31  2:32             ` David Vernet
2023-08-31  4:21               ` K Prateek Nayak
2023-08-31 10:45             ` [RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 1/3] sched/fair: Move SHARED_RUNQ related structs and definitions into sched.h K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle_balance K Prateek Nayak
2023-08-31 18:45                 ` David Vernet
2023-08-31 19:47                   ` K Prateek Nayak
2023-08-31 10:45               ` [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag K Prateek Nayak
2023-08-31 19:11                 ` David Vernet
2023-08-31 20:23                   ` K Prateek Nayak
2023-09-29 17:01                     ` David Vernet
2023-10-04  4:21                       ` K Prateek Nayak
2023-10-04 17:20                         ` David Vernet
2023-10-05  3:50                           ` K Prateek Nayak
2023-09-27  4:23                   ` K Prateek Nayak
2023-09-27  6:59                     ` Chen Yu
2023-09-27  8:36                       ` K Prateek Nayak
2023-09-28  8:41                         ` Chen Yu
2023-10-03 21:05                       ` David Vernet
2023-10-07  2:10                         ` Chen Yu
2023-09-27 13:08                     ` David Vernet
2023-11-27  8:28 ` [PATCH v3 0/7] sched: Implement shared runqueue in CFS Aboorva Devarajan
2023-11-27 19:49   ` David Vernet
2023-12-07  6:00     ` Aboorva Devarajan
2023-12-04 19:30 ` David Vernet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).