All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework
@ 2020-02-06 19:19 Valentin Schneider
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-06 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap, pkondeti

This series is about replacing the current wakeup logic for asymmetric CPU
capacity topologies, i.e. wake_cap().

Details are in patch 1, the TL;DR is that wake_cap() works fine for
"legacy" big.LITTLE systems (e.g. Juno), since the Last Level Cache (LLC)
domain of a CPU only spans CPUs of the same capacity, but somewhat broken
for newer DynamIQ systems (e.g. Dragonboard 845C), since the LLC domain of
a CPU can span all CPUs in the system. Both example boards are supported in
mainline.

A bit of history
================

Due to the old Energy Model (EM) used until Android Common Kernel v4.14
which grafted itself onto the sched domain hierarchy, mobile topologies
have been represented with "phantom domains"; IOW we'd make a DynamIQ
topology look like a big.LITTLE one:

actual hardware:

  +-------------------+
  |        L3         |
  +----+----+----+----+
  | L2 | L2 | L2 | L2 |
  +----+----+----+----+
  |CPU0|CPU1|CPU2|CPU3|
  +----+----+----+----+
     ^^^^^     ^^^^^
    LITTLEs    bigs

vanilla/mainline topology:

  MC [       ]
      0 1 2 3

phantom domains topology:

  DIE [        ]
  MC  [   ][   ]
       0 1  2 3

With the newer, mainline EM this is no longer required, and wake_cap() is
the last sticking point to getting rid of this legacy crud. More details
and examples are in patch 1.

Notes
=====

This removes the use of SD_BALANCE_WAKE for asymmetric CPU capacity
topologies (which are the last mainline users of that flag), as such it
shouldn't be a surprise that this comes with significant improvements to
wake-intensive workloads: wakeups no longer go through the
select_task_rq_fair() slow-path.

Testing
=======

I've picked sysbench --test=threads to mimic Peter's testing mentioned in

  commit 182a85f8a119 ("sched: Disable wakeup balancing")

Sysbench results are the number of events handled in a fixed amount of
time, so higher is better. Hackbench results are the usual time taken for
the thing, so lower is better.

Note: the 'X%' stats are the percentiles, so 50% is the 50th percentile.

Juno r0 ("legacy" big.LITTLE)
+++++++++++++++++++++++++++++

This is 2 bigs and 4 LITTLEs:

  +---------------+ +-------+
  |      L2       | |  L2   |
  +---+---+---+---+ +---+---+
  | L | L | L | L | | B | B |
  +---+---+---+---+ +---+---+


100 iterations of 'hackbench':

|      |   -PATCH |   +PATCH | DELTA (%) |
|------+----------+----------+-----------|
| mean | 0.631040 | 0.619610 |    -1.811 |
| std  | 0.025486 | 0.015798 |   -38.013 |
| min  | 0.582000 | 0.594000 |    +2.062 |
| 50%  | 0.628500 | 0.617500 |    -1.750 |
| 75%  | 0.645500 | 0.630000 |    -2.401 |
| 99%  | 0.697060 | 0.669030 |    -4.021 |
| max  | 0.703000 | 0.672000 |    -4.410 |

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=6 run':

|      |       -PATCH |       +PATCH | DELTA (%) |
|------+--------------+--------------+-----------|
| mean | 10267.760000 | 15137.930000 |   +47.432 |
| std  |  3110.439815 |   412.275289 |   -86.745 |
| min  |  7186.000000 | 14061.000000 |   +95.672 |
| 50%  |  9019.500000 | 15255.500000 |   +69.139 |
| 75%  | 12711.000000 | 15472.500000 |   +21.725 |
| 99%  | 15749.290000 | 15683.470000 |    -0.418 |
| max  | 15877.000000 | 15730.000000 |    -0.926 |

Note: you'll notice the results aren't as good as with v3; from playing
around with v4 this seems to come from removing the (broken) capacity_orig
heuristic. 

Pixel3 (DynamIQ)
++++++++++++++++

Ideally I would have used a DB845C but had a few issues with mine, so I
went with a mainline-ish Pixel3 instead [1]. It's still the same SoC under
the hood (Snapdragon 845), which has 4 bigs and 4 LITTLEs:

  +-------------------------------+
  |               L3              |
  +---+---+---+---+---+---+---+---+
  | L2| L2| L2| L2| L2| L2| L2| L2|
  +---+---+---+---+---+---+---+---+
  | L | L | L | L | B | B | B | B |
  +---+---+---+---+---+---+---+---+

Default topology (single MC domain)
-----------------------------------

100 iterations of 'hackbench -l 200'

|      |   -PATCH |   +PATCH | DELTA (%) |
|------+----------+----------+-----------|
| mean | 1.131360 | 1.102560 |    -2.546 |
| std  | 0.116322 | 0.101999 |   -12.313 |
| min  | 0.935000 | 0.935000 |    +0.000 |
| 50%  | 1.099000 | 1.097500 |    -0.136 |
| 75%  | 1.211250 | 1.157750 |    -4.417 |
| 99%  | 1.401020 | 1.338210 |    -4.483 |
| max  | 1.502000 | 1.359000 |    -9.521 |

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':

|      |      -PATCH |      +PATCH | DELTA (%) |
|------+-------------+-------------+-----------|
| mean | 7108.310000 | 8731.610000 |   +22.837 |
| std  |  199.431854 |  206.826912 |    +3.708 |
| min  | 6655.000000 | 8251.000000 |   +23.982 |
| 50%  | 7107.500000 | 8705.000000 |   +22.476 |
| 75%  | 7255.500000 | 8868.250000 |   +22.228 |
| 99%  | 7539.540000 | 9155.520000 |   +21.433 |
| max  | 7593.000000 | 9207.000000 |   +21.256 |

Phantom domains (MC + DIE)
--------------------------

This is mostly included for the sake of completeness.

100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':

|      |      -PATCH |      +PATCH | DELTA (%) |
|------+-------------+-------------+-----------|
| mean | 7317.940000 | 9328.470000 |   +27.474 |
| std  |  460.372682 |  181.528886 |   -60.569 |
| min  | 5888.000000 | 8832.000000 |   +50.000 |
| 50%  | 7271.000000 | 9348.000000 |   +28.566 |
| 75%  | 7497.500000 | 9477.250000 |   +26.405 |
| 99%  | 8464.390000 | 9634.160000 |   +13.820 |
| max  | 8602.000000 | 9650.000000 |   +12.183 |

Revisions
=========

v3 -> v4
--------
o Removed max capacity_orig heuristic (Dietmar)
o (new patch) Removed for_each_lower_domain() (Dietmar)
o Made select_idle_sibling() bail out after going through
  select_idle_capacity() (Pavan)
o Added use of sched_idle_cpu() in select_idle_capacity() (Pavan)
o Corrected the signoff order in patch 1

v2 -> v3
--------
o Added missing sync_entity_load_avg() (Quentin)
o Added fallback CPU selection (maximize capacity)
o Added special case for CPU hogs: task_fits_capacity() will always return 'false'
  for tasks that are simply too big, due to the margin.

v1 -> v2
--------
o Removed unrelated select_idle_core() change

[1]: https://git.linaro.org/people/amit.pundir/linux.git/log/?h=blueline-mainline-tracking

Morten Rasmussen (3):
  sched/fair: Add asymmetric CPU capacity wakeup scan
  sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
  sched/fair: Kill wake_cap()

Valentin Schneider (1):
  sched: Remove for_each_lower_domain()

 kernel/sched/fair.c     | 86 +++++++++++++++++++++++++++--------------
 kernel/sched/sched.h    |  2 -
 kernel/sched/topology.c | 15 ++-----
 3 files changed, 60 insertions(+), 43 deletions(-)

--
2.24.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
@ 2020-02-06 19:19 ` Valentin Schneider
  2020-02-07  5:08   ` Pavan Kondeti
                     ` (3 more replies)
  2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-06 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap, pkondeti

From: Morten Rasmussen <morten.rasmussen@arm.com>

Issue
=====

On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
drive select_task_rq_fair() towards either
- its slow-path (find_idlest_cpu()) if either the previous or
  current (waking) CPU has too little capacity for the waking task
- its fast-path (select_idle_sibling()) otherwise

Commit 3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance
at wake-up") points out that this relies on the assumption that "[...]the
CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are
homogeneous".

This assumption no longer holds on newer generations of big.LITTLE
systems (DynamIQ), which can accommodate CPUs of different compute capacity
within a single LLC domain. To hopefully paint a better picture, a regular
big.LITTLE topology would look like this:

  +---------+ +---------+
  |   L2    | |   L2    |
  +----+----+ +----+----+
  |CPU0|CPU1| |CPU2|CPU3|
  +----+----+ +----+----+
      ^^^         ^^^
    LITTLEs      bigs

which would result in the following scheduler topology:

  DIE [         ] <- sd_asym_cpucapacity
  MC  [   ] [   ] <- sd_llc
       0 1   2 3

Conversely, a DynamIQ topology could look like:

  +-------------------+
  |        L3         |
  +----+----+----+----+
  | L2 | L2 | L2 | L2 |
  +----+----+----+----+
  |CPU0|CPU1|CPU2|CPU3|
  +----+----+----+----+
     ^^^^^     ^^^^^
    LITTLEs    bigs

which would result in the following scheduler topology:

  MC [       ] <- sd_llc, sd_asym_cpucapacity
      0 1 2 3

What this means is that, on DynamIQ systems, we could pass the wake_cap()
test (IOW presume the waking task fits on the CPU capacities of some LLC
domain), thus go through select_idle_sibling().
This function operates on an LLC domain, which here spans both bigs and
LITTLEs, so it could very well pick a CPU of too small capacity for the
task, despite there being fitting idle CPUs - it very much depends on the
CPU iteration order, on which we have absolutely no guarantees
capacity-wise.

Implementation
==============

Introduce yet another select_idle_sibling() helper function that takes CPU
capacity into account. The policy is to pick the first idle CPU which is
big enough for the task (task_util * margin < cpu_capacity). If no
idle CPU is big enough, we pick the idle one with the highest capacity.

Unlike other select_idle_sibling() helpers, this one operates on the
sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
known CPU capacities in the system. As such, this will work for both
"legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).

Note that this limits the scope of select_idle_sibling() to
select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
will not be scanned, and no further heuristic will be applied.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Co-developed-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d7753756..9a5a6e9d2375e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5894,6 +5894,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	return cpu;
 }
 
+/*
+ * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
+ * the task fits. If no CPU is big enough, but there are idle ones, try to
+ * maximize capacity.
+ */
+static int
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	unsigned long best_cap = 0;
+	int cpu, best_cpu = -1;
+	struct cpumask *cpus;
+
+	sync_entity_load_avg(&p->se);
+
+	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
+	for_each_cpu_wrap(cpu, cpus, target) {
+		unsigned long cpu_cap = capacity_of(cpu);
+
+		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+			continue;
+		if (task_fits_capacity(p, cpu_cap))
+			return cpu;
+
+		if (cpu_cap > best_cap) {
+			best_cap = cpu_cap;
+			best_cpu = cpu;
+		}
+	}
+
+	return best_cpu;
+}
+
 /*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
@@ -5902,6 +5936,28 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * For asymmetric CPU capacity systems, our domain of interest is
+	 * sd_asym_cpucapacity rather than sd_llc.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
+		/*
+		 * On an asymmetric CPU capacity system where an exclusive
+		 * cpuset defines a symmetric island (i.e. one unique
+		 * capacity_orig value through the cpuset), the key will be set
+		 * but the CPUs within that cpuset will not have a domain with
+		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
+		 * capacity path.
+		 */
+		if (!sd)
+			goto symmetric;
+
+		i = select_idle_capacity(p, sd, target);
+		return ((unsigned)i < nr_cpumask_bits) ? i : target;
+	}
+
+symmetric:
 	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
  2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
@ 2020-02-06 19:19 ` Valentin Schneider
  2020-02-07 11:03   ` Quentin Perret
                     ` (2 more replies)
  2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-06 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap, pkondeti

From: Morten Rasmussen <morten.rasmussen@arm.com>

SD_BALANCE_WAKE was previously added to lower sched_domain levels on
asymmetric CPU capacity systems by commit 9ee1cda5ee25 ("sched/core: Enable
SD_BALANCE_WAKE for asymmetric capacity systems") to enable the use of
find_idlest_cpu() and friends to find an appropriate CPU for tasks.

That responsibility has now been shifted to select_idle_sibling() and
friends, and hence the flag can be removed. Note that this causes
asymmetric CPU capacity systems to no longer enter the slow wakeup path
(find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
with all other mainline topologies).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/topology.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dfb64c08a407a..00911884b7e7a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1374,18 +1374,9 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * Convert topological properties into behaviour.
 	 */
 
-	if (sd->flags & SD_ASYM_CPUCAPACITY) {
-		struct sched_domain *t = sd;
-
-		/*
-		 * Don't attempt to spread across CPUs of different capacities.
-		 */
-		if (sd->child)
-			sd->child->flags &= ~SD_PREFER_SIBLING;
-
-		for_each_lower_domain(t)
-			t->flags |= SD_BALANCE_WAKE;
-	}
+	/* Don't attempt to spread across CPUs of different capacities. */
+	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+		sd->child->flags &= ~SD_PREFER_SIBLING;
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->imbalance_pct = 110;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] sched: Remove for_each_lower_domain()
  2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
  2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
@ 2020-02-06 19:19 ` Valentin Schneider
  2020-02-07 11:04   ` Quentin Perret
                     ` (2 more replies)
  2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
  2020-02-07 10:42 ` [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Quentin Perret
  4 siblings, 3 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-06 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap, pkondeti

The last remaining user of this macro has just been removed, get rid of
it.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/sched.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8ad11b7..2f59e5ccd43ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1337,8 +1337,6 @@ extern void sched_ttwu_pending(void);
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
 			__sd; __sd = __sd->parent)
 
-#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
-
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:	The CPU whose highest level of sched domain is to
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/4] sched/fair: Kill wake_cap()
  2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
                   ` (2 preceding siblings ...)
  2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
@ 2020-02-06 19:19 ` Valentin Schneider
  2020-02-07 11:19   ` Quentin Perret
                     ` (2 more replies)
  2020-02-07 10:42 ` [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Quentin Perret
  4 siblings, 3 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-06 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap, pkondeti

From: Morten Rasmussen <morten.rasmussen@arm.com>

Capacity-awareness in the wake-up path previously involved disabling
wake_affine in certain scenarios. We have just made select_idle_sibling()
capacity-aware, so this isn't needed anymore.

Remove wake_cap() entirely.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5a6e9d2375e..9eedf1d6de3a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6143,33 +6143,6 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 	return min_t(unsigned long, util, capacity_orig_of(cpu));
 }
 
-/*
- * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
- * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
- *
- * In that case WAKE_AFFINE doesn't make sense and we'll let
- * BALANCE_WAKE sort things out.
- */
-static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
-{
-	long min_cap, max_cap;
-
-	if (!static_branch_unlikely(&sched_asym_cpucapacity))
-		return 0;
-
-	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
-	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
-
-	/* Minimum capacity is close to max, no need to abort wake_affine */
-	if (max_cap - min_cap < max_cap >> 3)
-		return 0;
-
-	/* Bring task utilization in sync with prev_cpu */
-	sync_entity_load_avg(&p->se);
-
-	return !task_fits_capacity(p, min_cap);
-}
-
 /*
  * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
  * to @dst_cpu.
@@ -6434,8 +6407,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}
 
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
 	rcu_read_lock();
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
@ 2020-02-07  5:08   ` Pavan Kondeti
  2020-02-07 10:18     ` Valentin Schneider
  2020-02-07 11:01   ` Quentin Perret
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Pavan Kondeti @ 2020-02-07  5:08 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap

On Thu, Feb 06, 2020 at 07:19:54PM +0000, Valentin Schneider wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 

<snip>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fe4e0d7753756..9a5a6e9d2375e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5894,6 +5894,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	return cpu;
>  }
>  
> +/*
> + * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> + * the task fits. If no CPU is big enough, but there are idle ones, try to
> + * maximize capacity.
> + */
> +static int
> +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> +	unsigned long best_cap = 0;
> +	int cpu, best_cpu = -1;
> +	struct cpumask *cpus;
> +
> +	sync_entity_load_avg(&p->se);
> +
> +	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
> +	for_each_cpu_wrap(cpu, cpus, target) {
> +		unsigned long cpu_cap = capacity_of(cpu);
> +
> +		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> +			continue;
> +		if (task_fits_capacity(p, cpu_cap))
> +			return cpu;
> +
> +		if (cpu_cap > best_cap) {
> +			best_cap = cpu_cap;
> +			best_cpu = cpu;
> +		}
> +	}
> +
> +	return best_cpu;
> +}
> +
>  /*
>   * Try and locate an idle core/thread in the LLC cache domain.
>   */
> @@ -5902,6 +5936,28 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	struct sched_domain *sd;
>  	int i, recent_used_cpu;
>  
> +	/*
> +	 * For asymmetric CPU capacity systems, our domain of interest is
> +	 * sd_asym_cpucapacity rather than sd_llc.
> +	 */
> +	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> +		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> +		/*
> +		 * On an asymmetric CPU capacity system where an exclusive
> +		 * cpuset defines a symmetric island (i.e. one unique
> +		 * capacity_orig value through the cpuset), the key will be set
> +		 * but the CPUs within that cpuset will not have a domain with
> +		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> +		 * capacity path.
> +		 */
> +		if (!sd)
> +			goto symmetric;
> +
> +		i = select_idle_capacity(p, sd, target);
> +		return ((unsigned)i < nr_cpumask_bits) ? i : target;
> +	}
> +
> +symmetric:
>  	if (available_idle_cpu(target) || sched_idle_cpu(target))
>  		return target;
>  
> -- 
> 2.24.0
> 

Looks good to me.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-07  5:08   ` Pavan Kondeti
@ 2020-02-07 10:18     ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-07 10:18 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, qperret, adharmap

Hi Pavan,

On 07/02/2020 05:08, Pavan Kondeti wrote:
> 
> Looks good to me.
> 

Thanks for having a look!

> Thanks,
> Pavan
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework
  2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
                   ` (3 preceding siblings ...)
  2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
@ 2020-02-07 10:42 ` Quentin Perret
  2020-02-07 12:41   ` Valentin Schneider
  4 siblings, 1 reply; 22+ messages in thread
From: Quentin Perret @ 2020-02-07 10:42 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On Thursday 06 Feb 2020 at 19:19:53 (+0000), Valentin Schneider wrote:
> Pixel3 (DynamIQ)
> ++++++++++++++++
> 
> Ideally I would have used a DB845C but had a few issues with mine, so I
> went with a mainline-ish Pixel3 instead [1]. It's still the same SoC under
> the hood (Snapdragon 845), which has 4 bigs and 4 LITTLEs:
> 
>   +-------------------------------+
>   |               L3              |
>   +---+---+---+---+---+---+---+---+
>   | L2| L2| L2| L2| L2| L2| L2| L2|
>   +---+---+---+---+---+---+---+---+
>   | L | L | L | L | B | B | B | B |
>   +---+---+---+---+---+---+---+---+
> 
> Default topology (single MC domain)
> -----------------------------------
> 
> 100 iterations of 'hackbench -l 200'
> 
> |      |   -PATCH |   +PATCH | DELTA (%) |
> |------+----------+----------+-----------|
> | mean | 1.131360 | 1.102560 |    -2.546 |
> | std  | 0.116322 | 0.101999 |   -12.313 |
> | min  | 0.935000 | 0.935000 |    +0.000 |
> | 50%  | 1.099000 | 1.097500 |    -0.136 |
> | 75%  | 1.211250 | 1.157750 |    -4.417 |
> | 99%  | 1.401020 | 1.338210 |    -4.483 |
> | max  | 1.502000 | 1.359000 |    -9.521 |
> 
> 100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':
> 
> |      |      -PATCH |      +PATCH | DELTA (%) |
> |------+-------------+-------------+-----------|
> | mean | 7108.310000 | 8731.610000 |   +22.837 |
> | std  |  199.431854 |  206.826912 |    +3.708 |
> | min  | 6655.000000 | 8251.000000 |   +23.982 |
> | 50%  | 7107.500000 | 8705.000000 |   +22.476 |
> | 75%  | 7255.500000 | 8868.250000 |   +22.228 |
> | 99%  | 7539.540000 | 9155.520000 |   +21.433 |
> | max  | 7593.000000 | 9207.000000 |   +21.256 |
> 
> Phantom domains (MC + DIE)
> --------------------------
> 
> This is mostly included for the sake of completeness.
> 
> 100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':
> 
> |      |      -PATCH |      +PATCH | DELTA (%) |
> |------+-------------+-------------+-----------|
> | mean | 7317.940000 | 9328.470000 |   +27.474 |
> | std  |  460.372682 |  181.528886 |   -60.569 |
> | min  | 5888.000000 | 8832.000000 |   +50.000 |
> | 50%  | 7271.000000 | 9348.000000 |   +28.566 |
> | 75%  | 7497.500000 | 9477.250000 |   +26.405 |
> | 99%  | 8464.390000 | 9634.160000 |   +13.820 |
> | max  | 8602.000000 | 9650.000000 |   +12.183 |


So, it feels like the most interesting test would be

 'baseline w/ phantom domains' vs 'this patch w/o phantom domains'

right ? The 'baseline w/o phantom domains' case is arguably borked today,
so it isn't that interesting (even though it performs well for the
particular workload you choose here, as expected, but I guess you might
see issues in others).

So, IIUC, based on your results above, that would be:

|      |     base+PD |  patch+noPD | DELTA (%) |
|------+-------------+-------------+-----------|
| mean | 7317.940000 | 8731.610000 |   +19.318 |
| std  |  460.372682 |  206.826912 |   -55.074 |
| min  | 5888.000000 | 8251.000000 |   +40.132 |
| 50%  | 7271.000000 | 8705.000000 |   +19.722 |
| 75%  | 7497.500000 | 8868.250000 |   +18.283 |
| 99%  | 8464.390000 | 9155.520000 |    +8.165 |
| max  | 8602.000000 | 9207.000000 |    +7.033 |

Is that correct ?

If so, this patch series is still a very big win, and I'm all for
getting it merged. But I find it interesting that the results aren't as
good as having this patch _and_ phantom domains at the same time ...

Any idea why having phantom domains helps ? select_idle_capacity()
should behave the same w/ or w/o phantom domains given that you use
sd_asym_cpucapacity directly. I'm guessing something else has an impact
here ? LB / misfit behaving a bit differently perhaps ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
  2020-02-07  5:08   ` Pavan Kondeti
@ 2020-02-07 11:01   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  3 siblings, 0 replies; 22+ messages in thread
From: Quentin Perret @ 2020-02-07 11:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On Thursday 06 Feb 2020 at 19:19:54 (+0000), Valentin Schneider wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 
> Issue
> =====
> 
> On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
> drive select_task_rq_fair() towards either
> - its slow-path (find_idlest_cpu()) if either the previous or
>   current (waking) CPU has too little capacity for the waking task
> - its fast-path (select_idle_sibling()) otherwise
> 
> Commit 3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance
> at wake-up") points out that this relies on the assumption that "[...]the
> CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are
> homogeneous".
> 
> This assumption no longer holds on newer generations of big.LITTLE
> systems (DynamIQ), which can accommodate CPUs of different compute capacity
> within a single LLC domain. To hopefully paint a better picture, a regular
> big.LITTLE topology would look like this:
> 
>   +---------+ +---------+
>   |   L2    | |   L2    |
>   +----+----+ +----+----+
>   |CPU0|CPU1| |CPU2|CPU3|
>   +----+----+ +----+----+
>       ^^^         ^^^
>     LITTLEs      bigs
> 
> which would result in the following scheduler topology:
> 
>   DIE [         ] <- sd_asym_cpucapacity
>   MC  [   ] [   ] <- sd_llc
>        0 1   2 3
> 
> Conversely, a DynamIQ topology could look like:
> 
>   +-------------------+
>   |        L3         |
>   +----+----+----+----+
>   | L2 | L2 | L2 | L2 |
>   +----+----+----+----+
>   |CPU0|CPU1|CPU2|CPU3|
>   +----+----+----+----+
>      ^^^^^     ^^^^^
>     LITTLEs    bigs
> 
> which would result in the following scheduler topology:
> 
>   MC [       ] <- sd_llc, sd_asym_cpucapacity
>       0 1 2 3
> 
> What this means is that, on DynamIQ systems, we could pass the wake_cap()
> test (IOW presume the waking task fits on the CPU capacities of some LLC
> domain), thus go through select_idle_sibling().
> This function operates on an LLC domain, which here spans both bigs and
> LITTLEs, so it could very well pick a CPU of too small capacity for the
> task, despite there being fitting idle CPUs - it very much depends on the
> CPU iteration order, on which we have absolutely no guarantees
> capacity-wise.
> 
> Implementation
> ==============
> 
> Introduce yet another select_idle_sibling() helper function that takes CPU
> capacity into account. The policy is to pick the first idle CPU which is
> big enough for the task (task_util * margin < cpu_capacity). If no
> idle CPU is big enough, we pick the idle one with the highest capacity.
> 
> Unlike other select_idle_sibling() helpers, this one operates on the
> sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
> known CPU capacities in the system. As such, this will work for both
> "legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
> newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).
> 
> Note that this limits the scope of select_idle_sibling() to
> select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
> will not be scanned, and no further heuristic will be applied.
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> Co-developed-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
  2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
@ 2020-02-07 11:03   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 0 replies; 22+ messages in thread
From: Quentin Perret @ 2020-02-07 11:03 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On Thursday 06 Feb 2020 at 19:19:55 (+0000), Valentin Schneider wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 
> SD_BALANCE_WAKE was previously added to lower sched_domain levels on
> asymmetric CPU capacity systems by commit 9ee1cda5ee25 ("sched/core: Enable
> SD_BALANCE_WAKE for asymmetric capacity systems") to enable the use of
> find_idlest_cpu() and friends to find an appropriate CPU for tasks.
> 
> That responsibility has now been shifted to select_idle_sibling() and
> friends, and hence the flag can be removed. Note that this causes
> asymmetric CPU capacity systems to no longer enter the slow wakeup path
> (find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
> with all other mainline topologies).
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> [Changelog tweaks]
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] sched: Remove for_each_lower_domain()
  2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
@ 2020-02-07 11:04   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] sched/core: " tip-bot2 for Valentin Schneider
  2020-02-20 20:09   ` tip-bot2 for Valentin Schneider
  2 siblings, 0 replies; 22+ messages in thread
From: Quentin Perret @ 2020-02-07 11:04 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On Thursday 06 Feb 2020 at 19:19:56 (+0000), Valentin Schneider wrote:
> The last remaining user of this macro has just been removed, get rid of
> it.
> 
> Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 4/4] sched/fair: Kill wake_cap()
  2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
@ 2020-02-07 11:19   ` Quentin Perret
  2020-02-07 12:48     ` Valentin Schneider
  2020-02-11 12:47   ` [tip: sched/core] sched/fair: Remove wake_cap() tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 1 reply; 22+ messages in thread
From: Quentin Perret @ 2020-02-07 11:19 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On Thursday 06 Feb 2020 at 19:19:57 (+0000), Valentin Schneider wrote:
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> 
> Capacity-awareness in the wake-up path previously involved disabling
> wake_affine in certain scenarios. We have just made select_idle_sibling()
> capacity-aware, so this isn't needed anymore.
> 
> Remove wake_cap() entirely.
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> [Changelog tweaks]
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

I wanted to suggest removing the CA code from update_sg_wakeup_stats()
which is now called only on fork/exec, but I suppose we still want it
for util_clamp, so n/m.

Thanks for the series,
Quentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework
  2020-02-07 10:42 ` [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Quentin Perret
@ 2020-02-07 12:41   ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-07 12:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti

On 07/02/2020 10:42, Quentin Perret wrote:
>> Phantom domains (MC + DIE)
>> --------------------------
>>
>> This is mostly included for the sake of completeness.
>>
>> 100 iterations of 'sysbench --max-time=5 --max-requests=-1 --test=threads --num-threads=8 run':
>>
>> |      |      -PATCH |      +PATCH | DELTA (%) |
>> |------+-------------+-------------+-----------|
>> | mean | 7317.940000 | 9328.470000 |   +27.474 |
>> | std  |  460.372682 |  181.528886 |   -60.569 |
>> | min  | 5888.000000 | 8832.000000 |   +50.000 |
>> | 50%  | 7271.000000 | 9348.000000 |   +28.566 |
>> | 75%  | 7497.500000 | 9477.250000 |   +26.405 |
>> | 99%  | 8464.390000 | 9634.160000 |   +13.820 |
>> | max  | 8602.000000 | 9650.000000 |   +12.183 |
> 
> 
> So, it feels like the most interesting test would be
> 
>  'baseline w/ phantom domains' vs 'this patch w/o phantom domains'
> 
> right ? The 'baseline w/o phantom domains' case is arguably borked today,
> so it isn't that interesting (even though it performs well for the
> particular workload you choose here, as expected, but I guess you might
> see issues in others).
> 
> So, IIUC, based on your results above, that would be:
> 
> |      |     base+PD |  patch+noPD | DELTA (%) |
> |------+-------------+-------------+-----------|
> | mean | 7317.940000 | 8731.610000 |   +19.318 |
> | std  |  460.372682 |  206.826912 |   -55.074 |
> | min  | 5888.000000 | 8251.000000 |   +40.132 |
> | 50%  | 7271.000000 | 8705.000000 |   +19.722 |
> | 75%  | 7497.500000 | 8868.250000 |   +18.283 |
> | 99%  | 8464.390000 | 9155.520000 |    +8.165 |
> | max  | 8602.000000 | 9207.000000 |    +7.033 |
> 
> Is that correct ?
> 

That does look like it!

> If so, this patch series is still a very big win, and I'm all for
> getting it merged. But I find it interesting that the results aren't as
> good as having this patch _and_ phantom domains at the same time ...
> 
> Any idea why having phantom domains helps ? select_idle_capacity()
> should behave the same w/ or w/o phantom domains given that you use
> sd_asym_cpucapacity directly.

My thoughts as well.

> I'm guessing something else has an impact
> here ? LB / misfit behaving a bit differently perhaps ?
> 

Mm so phantom domains further restrict which CPUs can pull during LB (see
should_we_balance()). On a "flat" topology with just an MC domain, any
CPU is free to pull from any other CPU at any time. With phantom domains,
LB isn't restricted at MC (i.e. within a group of equal capacities), but
is at DIE, so we'll bail out more often there.

So even though we go through load balance more often with phantom domains
(here it would be every 4 jiffies for MC, 8 for DIE), we may end up bailing
out more often that without phantom domains. Then again, there barely is
more than one task per CPU in the sysbench case, so we don't have a very
expensive task scan.

Also, I don't think it plays a role here, but phantom domains are more
conservative in terms of up/downmigration.
Since you balance first at MC level, if you have an imbalance you will try
to solve it within CPUs of same capacity value. Without phantom domains,
you just have a single domain that spans everything, so you could move a
small task to a big CPU, despite there being spare LITTLEs.

Investing this some more is dangling somewhere on my todo-list.

> Thanks,
> Quentin
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 4/4] sched/fair: Kill wake_cap()
  2020-02-07 11:19   ` Quentin Perret
@ 2020-02-07 12:48     ` Valentin Schneider
  0 siblings, 0 replies; 22+ messages in thread
From: Valentin Schneider @ 2020-02-07 12:48 UTC (permalink / raw)
  To: Quentin Perret
  Cc: linux-kernel, mingo, peterz, vincent.guittot, dietmar.eggemann,
	morten.rasmussen, adharmap, pkondeti



On 07/02/2020 11:19, Quentin Perret wrote:
> On Thursday 06 Feb 2020 at 19:19:57 (+0000), Valentin Schneider wrote:
>> From: Morten Rasmussen <morten.rasmussen@arm.com>
>>
>> Capacity-awareness in the wake-up path previously involved disabling
>> wake_affine in certain scenarios. We have just made select_idle_sibling()
>> capacity-aware, so this isn't needed anymore.
>>
>> Remove wake_cap() entirely.
>>
>> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> [Changelog tweaks]
>> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> 
> Reviewed-by: Quentin Perret <qperret@google.com>
> 
> I wanted to suggest removing the CA code from update_sg_wakeup_stats()
> which is now called only on fork/exec, but I suppose we still want it
> for util_clamp, so n/m.
> 

Good point, I hadn't thought about this. As you say we probably want to
keep it since we can fork/exec into a cgroup that has uclamp values already
set up, and that would drive task_fits_capacity().

> Thanks for the series,

Thanks for the review!

> Quentin
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/core: Remove for_each_lower_domain()
  2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
  2020-02-07 11:04   ` Quentin Perret
@ 2020-02-11 12:47   ` tip-bot2 for Valentin Schneider
  2020-02-20 20:09   ` tip-bot2 for Valentin Schneider
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2020-02-11 12:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     afbcf99785a580e2cc089646e971deceb31a18b1
Gitweb:        https://git.kernel.org/tip/afbcf99785a580e2cc089646e971deceb31a18b1
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:56 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 11 Feb 2020 13:03:35 +01:00

sched/core: Remove for_each_lower_domain()

The last remaining user of this macro has just been removed, get rid of it.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com
---
 kernel/sched/sched.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0844e81..878910e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1337,8 +1337,6 @@ extern void sched_ttwu_pending(void);
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
 			__sd; __sd = __sd->parent)
 
-#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
-
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:	The CPU whose highest level of sched domain is to

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Remove wake_cap()
  2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
  2020-02-07 11:19   ` Quentin Perret
@ 2020-02-11 12:47   ` tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-11 12:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     25ac227a25ac946e0356772012398cd1710a8bab
Gitweb:        https://git.kernel.org/tip/25ac227a25ac946e0356772012398cd1710a8bab
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:57 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 11 Feb 2020 13:04:16 +01:00

sched/fair: Remove wake_cap()

Capacity-awareness in the wake-up path previously involved disabling
wake_affine in certain scenarios. We have just made select_idle_sibling()
capacity-aware, so this isn't needed anymore.

Remove wake_cap() entirely.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Changelog tweaks]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com
---
 kernel/sched/fair.c | 30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fb47a2..a7e11b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,33 +6146,6 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 }
 
 /*
- * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
- * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
- *
- * In that case WAKE_AFFINE doesn't make sense and we'll let
- * BALANCE_WAKE sort things out.
- */
-static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
-{
-	long min_cap, max_cap;
-
-	if (!static_branch_unlikely(&sched_asym_cpucapacity))
-		return 0;
-
-	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
-	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
-
-	/* Minimum capacity is close to max, no need to abort wake_affine */
-	if (max_cap - min_cap < max_cap >> 3)
-		return 0;
-
-	/* Bring task utilization in sync with prev_cpu */
-	sync_entity_load_avg(&p->se);
-
-	return !task_fits_capacity(p, min_cap);
-}
-
-/*
  * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
  * to @dst_cpu.
  */
@@ -6436,8 +6409,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}
 
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
  2020-02-07  5:08   ` Pavan Kondeti
  2020-02-07 11:01   ` Quentin Perret
@ 2020-02-11 12:47   ` tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  3 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-11 12:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     913c310c8e8abb6a9eb8b3c8bfc33bd1dddded04
Gitweb:        https://git.kernel.org/tip/913c310c8e8abb6a9eb8b3c8bfc33bd1dddded04
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:54 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 11 Feb 2020 13:01:43 +01:00

sched/fair: Add asymmetric CPU capacity wakeup scan

Issue
=====

On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
drive select_task_rq_fair() towards either:

- its slow-path (find_idlest_cpu()) if either the previous or
  current (waking) CPU has too little capacity for the waking task
- its fast-path (select_idle_sibling()) otherwise

Commit:

  3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

points out that this relies on the assumption that "[...]the CPU capacities
within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous".

This assumption no longer holds on newer generations of big.LITTLE
systems (DynamIQ), which can accommodate CPUs of different compute capacity
within a single LLC domain. To hopefully paint a better picture, a regular
big.LITTLE topology would look like this:

  +---------+ +---------+
  |   L2    | |   L2    |
  +----+----+ +----+----+
  |CPU0|CPU1| |CPU2|CPU3|
  +----+----+ +----+----+
      ^^^         ^^^
    LITTLEs      bigs

which would result in the following scheduler topology:

  DIE [         ] <- sd_asym_cpucapacity
  MC  [   ] [   ] <- sd_llc
       0 1   2 3

Conversely, a DynamIQ topology could look like:

  +-------------------+
  |        L3         |
  +----+----+----+----+
  | L2 | L2 | L2 | L2 |
  +----+----+----+----+
  |CPU0|CPU1|CPU2|CPU3|
  +----+----+----+----+
     ^^^^^     ^^^^^
    LITTLEs    bigs

which would result in the following scheduler topology:

  MC [       ] <- sd_llc, sd_asym_cpucapacity
      0 1 2 3

What this means is that, on DynamIQ systems, we could pass the wake_cap()
test (IOW presume the waking task fits on the CPU capacities of some LLC
domain), thus go through select_idle_sibling().
This function operates on an LLC domain, which here spans both bigs and
LITTLEs, so it could very well pick a CPU of too small capacity for the
task, despite there being fitting idle CPUs - it very much depends on the
CPU iteration order, on which we have absolutely no guarantees
capacity-wise.

Implementation
==============

Introduce yet another select_idle_sibling() helper function that takes CPU
capacity into account. The policy is to pick the first idle CPU which is
big enough for the task (task_util * margin < cpu_capacity). If no
idle CPU is big enough, we pick the idle one with the highest capacity.

Unlike other select_idle_sibling() helpers, this one operates on the
sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
known CPU capacities in the system. As such, this will work for both
"legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).

Note that this limits the scope of select_idle_sibling() to
select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
will not be scanned, and no further heuristic will be applied.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com
---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a0ce83..6fb47a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,6 +5897,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 }
 
 /*
+ * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
+ * the task fits. If no CPU is big enough, but there are idle ones, try to
+ * maximize capacity.
+ */
+static int
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	unsigned long best_cap = 0;
+	int cpu, best_cpu = -1;
+	struct cpumask *cpus;
+
+	sync_entity_load_avg(&p->se);
+
+	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
+	for_each_cpu_wrap(cpu, cpus, target) {
+		unsigned long cpu_cap = capacity_of(cpu);
+
+		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+			continue;
+		if (task_fits_capacity(p, cpu_cap))
+			return cpu;
+
+		if (cpu_cap > best_cap) {
+			best_cap = cpu_cap;
+			best_cpu = cpu;
+		}
+	}
+
+	return best_cpu;
+}
+
+/*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
@@ -5904,6 +5938,28 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * For asymmetric CPU capacity systems, our domain of interest is
+	 * sd_asym_cpucapacity rather than sd_llc.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
+		/*
+		 * On an asymmetric CPU capacity system where an exclusive
+		 * cpuset defines a symmetric island (i.e. one unique
+		 * capacity_orig value through the cpuset), the key will be set
+		 * but the CPUs within that cpuset will not have a domain with
+		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
+		 * capacity path.
+		 */
+		if (!sd)
+			goto symmetric;
+
+		i = select_idle_capacity(p, sd, target);
+		return ((unsigned)i < nr_cpumask_bits) ? i : target;
+	}
+
+symmetric:
 	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
  2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
  2020-02-07 11:03   ` Quentin Perret
@ 2020-02-11 12:47   ` tip-bot2 for Morten Rasmussen
  2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-11 12:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     38c6e4963b509c47c33539534b84195558c0376d
Gitweb:        https://git.kernel.org/tip/38c6e4963b509c47c33539534b84195558c0376d
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:55 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 11 Feb 2020 13:02:50 +01:00

sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems

SD_BALANCE_WAKE was previously added to lower sched_domain levels on
asymmetric CPU capacity systems by commit:

  9ee1cda5ee25 ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems")

to enable the use of find_idlest_cpu() and friends to find an appropriate
CPU for tasks.

That responsibility has now been shifted to select_idle_sibling() and
friends, and hence the flag can be removed. Note that this causes
asymmetric CPU capacity systems to no longer enter the slow wakeup path
(find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
with all other mainline topologies).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com
---
 kernel/sched/topology.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dfb64c0..0091188 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1374,18 +1374,9 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * Convert topological properties into behaviour.
 	 */
 
-	if (sd->flags & SD_ASYM_CPUCAPACITY) {
-		struct sched_domain *t = sd;
-
-		/*
-		 * Don't attempt to spread across CPUs of different capacities.
-		 */
-		if (sd->child)
-			sd->child->flags &= ~SD_PREFER_SIBLING;
-
-		for_each_lower_domain(t)
-			t->flags |= SD_BALANCE_WAKE;
-	}
+	/* Don't attempt to spread across CPUs of different capacities. */
+	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+		sd->child->flags &= ~SD_PREFER_SIBLING;
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->imbalance_pct = 110;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Remove wake_cap()
  2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
  2020-02-07 11:19   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] sched/fair: Remove wake_cap() tip-bot2 for Morten Rasmussen
@ 2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-20 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Thomas Gleixner, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     000619680c3714020ce9db17eef6a4a7ce2dc28b
Gitweb:        https://git.kernel.org/tip/000619680c3714020ce9db17eef6a4a7ce2dc28b
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:57 
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 20 Feb 2020 21:03:15 +01:00

sched/fair: Remove wake_cap()

Capacity-awareness in the wake-up path previously involved disabling
wake_affine in certain scenarios. We have just made select_idle_sibling()
capacity-aware, so this isn't needed anymore.

Remove wake_cap() entirely.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Changelog tweaks]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com

---
 kernel/sched/fair.c | 30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fb47a2..a7e11b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6146,33 +6146,6 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 }
 
 /*
- * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
- * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
- *
- * In that case WAKE_AFFINE doesn't make sense and we'll let
- * BALANCE_WAKE sort things out.
- */
-static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
-{
-	long min_cap, max_cap;
-
-	if (!static_branch_unlikely(&sched_asym_cpucapacity))
-		return 0;
-
-	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
-	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
-
-	/* Minimum capacity is close to max, no need to abort wake_affine */
-	if (max_cap - min_cap < max_cap >> 3)
-		return 0;
-
-	/* Bring task utilization in sync with prev_cpu */
-	sync_entity_load_avg(&p->se);
-
-	return !task_fits_capacity(p, min_cap);
-}
-
-/*
  * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
  * to @dst_cpu.
  */
@@ -6436,8 +6409,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			new_cpu = prev_cpu;
 		}
 
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
-			      cpumask_test_cpu(cpu, p->cpus_ptr);
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
  2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
  2020-02-07 11:03   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
@ 2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-20 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Thomas Gleixner, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a526d466798d65cff120ee00ef92931075bf3769
Gitweb:        https://git.kernel.org/tip/a526d466798d65cff120ee00ef92931075bf3769
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:55 
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 20 Feb 2020 21:03:14 +01:00

sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems

SD_BALANCE_WAKE was previously added to lower sched_domain levels on
asymmetric CPU capacity systems by commit:

  9ee1cda5ee25 ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems")

to enable the use of find_idlest_cpu() and friends to find an appropriate
CPU for tasks.

That responsibility has now been shifted to select_idle_sibling() and
friends, and hence the flag can be removed. Note that this causes
asymmetric CPU capacity systems to no longer enter the slow wakeup path
(find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
with all other mainline topologies).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com

---
 kernel/sched/topology.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dfb64c0..0091188 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1374,18 +1374,9 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * Convert topological properties into behaviour.
 	 */
 
-	if (sd->flags & SD_ASYM_CPUCAPACITY) {
-		struct sched_domain *t = sd;
-
-		/*
-		 * Don't attempt to spread across CPUs of different capacities.
-		 */
-		if (sd->child)
-			sd->child->flags &= ~SD_PREFER_SIBLING;
-
-		for_each_lower_domain(t)
-			t->flags |= SD_BALANCE_WAKE;
-	}
+	/* Don't attempt to spread across CPUs of different capacities. */
+	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+		sd->child->flags &= ~SD_PREFER_SIBLING;
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->imbalance_pct = 110;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/core: Remove for_each_lower_domain()
  2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
  2020-02-07 11:04   ` Quentin Perret
  2020-02-11 12:47   ` [tip: sched/core] sched/core: " tip-bot2 for Valentin Schneider
@ 2020-02-20 20:09   ` tip-bot2 for Valentin Schneider
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Valentin Schneider @ 2020-02-20 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Thomas Gleixner, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f8459197e75b045d8d1d87b9856486b39e375721
Gitweb:        https://git.kernel.org/tip/f8459197e75b045d8d1d87b9856486b39e375721
Author:        Valentin Schneider <valentin.schneider@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:56 
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 20 Feb 2020 21:03:15 +01:00

sched/core: Remove for_each_lower_domain()

The last remaining user of this macro has just been removed, get rid of it.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com

---
 kernel/sched/sched.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0844e81..878910e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1337,8 +1337,6 @@ extern void sched_ttwu_pending(void);
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
 			__sd; __sd = __sd->parent)
 
-#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
-
 /**
  * highest_flag_domain - Return highest sched_domain containing flag.
  * @cpu:	The CPU whose highest level of sched domain is to

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip: sched/core] sched/fair: Add asymmetric CPU capacity wakeup scan
  2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
                     ` (2 preceding siblings ...)
  2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
@ 2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
  3 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Morten Rasmussen @ 2020-02-20 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Morten Rasmussen, Valentin Schneider, Peter Zijlstra (Intel),
	Ingo Molnar, Thomas Gleixner, Quentin Perret, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b7a331615d254191e7f5f0e35aec9adcd6acdc54
Gitweb:        https://git.kernel.org/tip/b7a331615d254191e7f5f0e35aec9adcd6acdc54
Author:        Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate:    Thu, 06 Feb 2020 19:19:54 
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 20 Feb 2020 21:03:14 +01:00

sched/fair: Add asymmetric CPU capacity wakeup scan

Issue
=====

On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
drive select_task_rq_fair() towards either:

- its slow-path (find_idlest_cpu()) if either the previous or
  current (waking) CPU has too little capacity for the waking task
- its fast-path (select_idle_sibling()) otherwise

Commit:

  3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

points out that this relies on the assumption that "[...]the CPU capacities
within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous".

This assumption no longer holds on newer generations of big.LITTLE
systems (DynamIQ), which can accommodate CPUs of different compute capacity
within a single LLC domain. To hopefully paint a better picture, a regular
big.LITTLE topology would look like this:

  +---------+ +---------+
  |   L2    | |   L2    |
  +----+----+ +----+----+
  |CPU0|CPU1| |CPU2|CPU3|
  +----+----+ +----+----+
      ^^^         ^^^
    LITTLEs      bigs

which would result in the following scheduler topology:

  DIE [         ] <- sd_asym_cpucapacity
  MC  [   ] [   ] <- sd_llc
       0 1   2 3

Conversely, a DynamIQ topology could look like:

  +-------------------+
  |        L3         |
  +----+----+----+----+
  | L2 | L2 | L2 | L2 |
  +----+----+----+----+
  |CPU0|CPU1|CPU2|CPU3|
  +----+----+----+----+
     ^^^^^     ^^^^^
    LITTLEs    bigs

which would result in the following scheduler topology:

  MC [       ] <- sd_llc, sd_asym_cpucapacity
      0 1 2 3

What this means is that, on DynamIQ systems, we could pass the wake_cap()
test (IOW presume the waking task fits on the CPU capacities of some LLC
domain), thus go through select_idle_sibling().
This function operates on an LLC domain, which here spans both bigs and
LITTLEs, so it could very well pick a CPU of too small capacity for the
task, despite there being fitting idle CPUs - it very much depends on the
CPU iteration order, on which we have absolutely no guarantees
capacity-wise.

Implementation
==============

Introduce yet another select_idle_sibling() helper function that takes CPU
capacity into account. The policy is to pick the first idle CPU which is
big enough for the task (task_util * margin < cpu_capacity). If no
idle CPU is big enough, we pick the idle one with the highest capacity.

Unlike other select_idle_sibling() helpers, this one operates on the
sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
known CPU capacities in the system. As such, this will work for both
"legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).

Note that this limits the scope of select_idle_sibling() to
select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
will not be scanned, and no further heuristic will be applied.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com

---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a0ce83..6fb47a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,6 +5897,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 }
 
 /*
+ * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
+ * the task fits. If no CPU is big enough, but there are idle ones, try to
+ * maximize capacity.
+ */
+static int
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	unsigned long best_cap = 0;
+	int cpu, best_cpu = -1;
+	struct cpumask *cpus;
+
+	sync_entity_load_avg(&p->se);
+
+	cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
+	for_each_cpu_wrap(cpu, cpus, target) {
+		unsigned long cpu_cap = capacity_of(cpu);
+
+		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+			continue;
+		if (task_fits_capacity(p, cpu_cap))
+			return cpu;
+
+		if (cpu_cap > best_cap) {
+			best_cap = cpu_cap;
+			best_cpu = cpu;
+		}
+	}
+
+	return best_cpu;
+}
+
+/*
  * Try and locate an idle core/thread in the LLC cache domain.
  */
 static int select_idle_sibling(struct task_struct *p, int prev, int target)
@@ -5904,6 +5938,28 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i, recent_used_cpu;
 
+	/*
+	 * For asymmetric CPU capacity systems, our domain of interest is
+	 * sd_asym_cpucapacity rather than sd_llc.
+	 */
+	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
+		/*
+		 * On an asymmetric CPU capacity system where an exclusive
+		 * cpuset defines a symmetric island (i.e. one unique
+		 * capacity_orig value through the cpuset), the key will be set
+		 * but the CPUs within that cpuset will not have a domain with
+		 * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
+		 * capacity path.
+		 */
+		if (!sd)
+			goto symmetric;
+
+		i = select_idle_capacity(p, sd, target);
+		return ((unsigned)i < nr_cpumask_bits) ? i : target;
+	}
+
+symmetric:
 	if (available_idle_cpu(target) || sched_idle_cpu(target))
 		return target;
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-02-20 20:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-06 19:19 [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Valentin Schneider
2020-02-06 19:19 ` [PATCH v4 1/4] sched/fair: Add asymmetric CPU capacity wakeup scan Valentin Schneider
2020-02-07  5:08   ` Pavan Kondeti
2020-02-07 10:18     ` Valentin Schneider
2020-02-07 11:01   ` Quentin Perret
2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
2020-02-06 19:19 ` [PATCH v4 2/4] sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems Valentin Schneider
2020-02-07 11:03   ` Quentin Perret
2020-02-11 12:47   ` [tip: sched/core] " tip-bot2 for Morten Rasmussen
2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
2020-02-06 19:19 ` [PATCH v4 3/4] sched: Remove for_each_lower_domain() Valentin Schneider
2020-02-07 11:04   ` Quentin Perret
2020-02-11 12:47   ` [tip: sched/core] sched/core: " tip-bot2 for Valentin Schneider
2020-02-20 20:09   ` tip-bot2 for Valentin Schneider
2020-02-06 19:19 ` [PATCH v4 4/4] sched/fair: Kill wake_cap() Valentin Schneider
2020-02-07 11:19   ` Quentin Perret
2020-02-07 12:48     ` Valentin Schneider
2020-02-11 12:47   ` [tip: sched/core] sched/fair: Remove wake_cap() tip-bot2 for Morten Rasmussen
2020-02-20 20:09   ` tip-bot2 for Morten Rasmussen
2020-02-07 10:42 ` [PATCH v4 0/4] sched/fair: Capacity aware wakeup rework Quentin Perret
2020-02-07 12:41   ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.