linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair()
@ 2019-12-11 16:43 Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 1/7] sched: Add WF_TTWU, WF_EXEC wakeup flags Valentin Schneider
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, Juri Lelli,
	Steven Rostedt

I've been staring at select_task_rq_fair() for some time now, and have
come to "hate" the usage of the sd_flag parameter. It is used as both an
indicator of the wakeup type (ttwu/fork/exec) and as a domain walk search
target. CFS is the only class doing this, the other classes just need the
wakeup type but get passed a domain flag instead.

This series gets rid of select_task_rq()'s sd_flag parameter and also
tries to optimize select_task_rq_fair().

There's a few ugly things in there that could be improved with some extra
lawnmowing, hence me sending this as an RFC.

This is based on v5.5-rc1.

Patches
=======

Patches 1-2 involve getting rid of the sd_flag parameter for select_task_rq().

Patches 3-4 are extra cleanups in the select_task_rq_fair() region.

Patches 5-7 split the want_affine vs !want_affine paths of
select_task_rq_fair(), which unearths a domain walk that can be cached.
Patches 6 & 7 should be considered as a single patch, they are only split
to hopefully ease reviewing.

Discussion points
=================

The use of SD_LOAD_BALANCE forced me to do a few ugly things.

Patch 5 is only required because the domain walk in select_task_rq_fair()
is ceiled by the presence of SD_LOAD_BALANCE. Thing is (I also ramble about
this in the changelog of patch 7) AFAIK this flag is set unconditionally in
sd_init() and the only way to clear it is via the sched_domain sysctl,
which is a SCHED_DEBUG interface. I haven't found anything with cpusets
that would clear it; AFAICT cpusets can end up attaching the NULL domain to
some CPUs (with sched_load_balance=0) but that's as far as this goes:

  $ git grep SD_LOAD_BALANCE
  include/linux/sched/topology.h:#define SD_LOAD_BALANCE          0x0001  /* Do load balancing on this domain. */
  kernel/sched/fair.c:            if (!(tmp->flags & SD_LOAD_BALANCE))
  kernel/sched/fair.c:            if ((sd->flags & SD_LOAD_BALANCE) &&
  kernel/sched/fair.c:            if (!(sd->flags & SD_LOAD_BALANCE))
  kernel/sched/fair.c:            if (!(sd->flags & SD_LOAD_BALANCE))
  kernel/sched/topology.c:        if (!(sd->flags & SD_LOAD_BALANCE)) {
  kernel/sched/topology.c:                        printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
  kernel/sched/topology.c:        if (sd->flags & (SD_LOAD_BALANCE |
  kernel/sched/topology.c:                pflags &= ~(SD_LOAD_BALANCE |
  kernel/sched/topology.c:                .flags                  = 1*SD_LOAD_BALANCE

On top of this, the sysctl interface can break the SD pointers cached in
update_top_cache_domain() (again, see patch 7).

In short, we should:
- Fix the stale sched_domain cached pointer issue by either making the
  sysctl sd->flags entry read-only, or have it cause a cached pointer
  update

- Remove SD_LOAD_BALANCE (or at the very least its usage sites). IMO, not
  doing load balance translates to not having a sched_domain.

I'll be happy to do either of these, but I wanted to gather opinions before
that.

Testing
=======

Performance has been rather annoying to measure. I used to have
some improvements on Juno (between 1% to 2%) and just noise on Xeon with my
previous base - some tip/sched/core around 5.4-rc7. This has somehow changed
with v5.5-rc1. The baseline also got about 2% worse (on Juno), I'll go and try
to bisect this but I've been sitting on this for a while so I wanted to toss it
out.

Juno r0
-------

SD levels are {MC, DIE}.
500 iterations of $(hackbench):

  |       |   -PATCH |   +PATCH | DELTA (%) |
  |-------+----------+----------+-----------|
  | count |      500 |      500 |           |
  | mean  | 0.630625 | 0.628632 |    -0.316 |
  | std   | 0.022276 | 0.024253 |    +8.875 |
  | min   | 0.584000 | 0.580000 |    -0.685 |
  | 50%   | 0.629000 | 0.625000 |    -0.636 |
  | 95%   | 0.666000 | 0.674000 |    +1.201 |
  | 99%   | 0.703000 | 0.699000 |    -0.569 |
  | max   | 0.748000 | 0.722000 |    -3.476 |

I also tried pinning that to a single type of CPUs to reduce the 'noise'
of asymmetric scheduling, this is 500 iterations of
$(taskset -c 0,3-5 hackbench) (IOW LITTLEs only):

  |       |   -PATCH |   +PATCH | DELTA (%) |
  |-------+----------+----------+-----------|
  | count |      500 |      500 |           |
  | mean  | 1.092194 | 1.084328 |    -0.720 |
  | std   | 0.046710 | 0.049637 |    +6.266 |
  | min   | 1.017000 | 1.004000 |    -1.278 |
  | 50%   | 1.079000 | 1.072000 |    -0.649 |
  | 95%   | 1.177100 | 1.173300 |    -0.323 |
  | 99%   | 1.273060 | 1.261130 |    -0.937 |
  | max   | 1.342000 | 1.305000 |    -2.757 |

Dual-socket Xeon E5
-------------------

That is 10 cores w/ SMT2 per socket. SD levels are {SMT, MC, NUMA}.
500 iterations of $(hackbench -l 1000):

  |       |   -PATCH |   +PATCH | DELTA (%) |
  |-------+----------+----------+-----------|
  | count |      500 |      500 |           |
  | mean  | 0.949278 | 0.946502 |    -0.292 |
  | std   | 0.006402 | 0.006784 |    +5.967 |
  | min   | 0.924000 | 0.921000 |    -0.325 |
  | 50%   | 0.950000 | 0.947000 |    -0.316 |
  | 95%   | 0.959000 | 0.957000 |    -0.209 |
  | 99%   | 0.963000 | 0.960010 |    -0.310 |
  | max   | 0.966000 | 0.963000 |    -0.311 |

Notes
=====

One thing of note is that we are considering getting rid of SD_BALANCE_WAKE
for asymmetric systems, which are the last remaining (upstream) users of
that flag. This would lower the usefulness of this series; nevertheless I
think the cleanup (or at least part of it) is worth it.


Valentin Schneider (7):
  sched: Add WF_TTWU, WF_EXEC wakeup flags
  sched: Kill select_task_rq()'s sd_flag parameter
  sched/fair: find_idlest_group(): Remove unused sd_flag parameter
  sched/fair: Dissociate wakeup decisions from SD flag value
  sched/topology: Make {lowest/highest}_flag_domain() work with > 1
    flags
  sched/fair: Split select_task_rq_fair want_affine logic
  sched/topology: Define and use shortcut pointers for wakeup sd_flag
    scan

 kernel/sched/core.c      | 10 +++----
 kernel/sched/deadline.c  |  4 +--
 kernel/sched/fair.c      | 61 +++++++++++++++++++++++++---------------
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  4 +--
 kernel/sched/sched.h     | 36 ++++++++++++++++--------
 kernel/sched/stop_task.c |  2 +-
 kernel/sched/topology.c  | 20 ++++++++++---
 8 files changed, 91 insertions(+), 48 deletions(-)

--
2.24.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/7] sched: Add WF_TTWU, WF_EXEC wakeup flags
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
@ 2019-12-11 16:43 ` Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 2/7] sched: Kill select_task_rq()'s sd_flag parameter Valentin Schneider
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, Juri Lelli,
	Steven Rostedt

To remove the sd_flag parameter of select_task_rq(), we need another way of
encoding wakeup kinds. There already is a WF_FORK flag, add the missing
ones.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/sched.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c735935..1936ab2faff0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1639,11 +1639,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /*
- * wake flags
+ * Wake flags
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
-#define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_TTWU                 0x02            /* Regular task wakeup */
+#define WF_FORK			0x04		/* Child wakeup after fork */
+#define WF_EXEC			0x08		/* "Fake" wakeup at exec */
+#define WF_MIGRATED		0x10		/* Internal use, task got migrated */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/7] sched: Kill select_task_rq()'s sd_flag parameter
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 1/7] sched: Add WF_TTWU, WF_EXEC wakeup flags Valentin Schneider
@ 2019-12-11 16:43 ` Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 3/7] sched/fair: find_idlest_group(): Remove unused " Valentin Schneider
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, Juri Lelli,
	Steven Rostedt

Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. This cleans up the API a
bit, but adds an extra conversion layer within select_task_rq_fair().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/core.c      | 10 +++++-----
 kernel/sched/deadline.c  |  4 ++--
 kernel/sched/fair.c      | 18 +++++++++++++++---
 kernel/sched/idle.c      |  2 +-
 kernel/sched/rt.c        |  4 ++--
 kernel/sched/sched.h     |  2 +-
 kernel/sched/stop_task.c |  2 +-
 7 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..38b434846d5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2101,12 +2101,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
  */
 static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int wake_flags)
 {
 	lockdep_assert_held(&p->pi_lock);
 
 	if (p->nr_cpus_allowed > 1)
-		cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+		cpu = p->sched_class->select_task_rq(p, cpu, wake_flags);
 	else
 		cpu = cpumask_any(p->cpus_ptr);
 
@@ -2625,7 +2625,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		atomic_dec(&task_rq(p)->nr_iowait);
 	}
 
-	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+	cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		psi_ttwu_dequeue(p);
@@ -2958,7 +2958,7 @@ void wake_up_new_task(struct task_struct *p)
 	 * as we're not fully set-up yet.
 	 */
 	p->recent_used_cpu = task_cpu(p);
-	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	__set_task_cpu(p, select_task_rq(p, task_cpu(p), WF_FORK));
 #endif
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
@@ -3499,7 +3499,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 43323f875cb9..53216fffb871 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1599,12 +1599,12 @@ static void yield_task_dl(struct rq *rq)
 static int find_later_rq(struct task_struct *task);
 
 static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
 
-	if (sd_flag != SD_BALANCE_WAKE)
+	if (!(flags & WF_TTWU))
 		goto out;
 
 	rq = cpu_rq(cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..a614e4794024 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6325,7 +6325,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
- * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
+ * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
  *
  * Balances load by selecting the idlest CPU in the idlest group, or under
@@ -6336,13 +6336,25 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 {
+	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
 	int want_affine = 0;
-	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	int sd_flag;
+
+	switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
+	case WF_TTWU:
+		sd_flag = SD_BALANCE_WAKE;
+		break;
+	case WF_FORK:
+		sd_flag = SD_BALANCE_FORK;
+		break;
+	default:
+		sd_flag = SD_BALANCE_EXEC;
+	}
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index ffa959e91227..416185886e7f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -367,7 +367,7 @@ void cpu_startup_entry(enum cpuhp_state state)
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e591d40fd645..ae579d79ddc0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1387,13 +1387,13 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
 
 	/* For anything but wake ups, just return the task_cpu */
-	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+	if (!(flags & (WF_TTWU | WF_FORK)))
 		goto out;
 
 	rq = cpu_rq(cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1936ab2faff0..62efc0443cac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1722,7 +1722,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
-	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4c9e9975684f..4f061ddf8470 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int flags)
 {
 	return task_cpu(p); /* stop tasks as never migrate */
 }
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/7] sched/fair: find_idlest_group(): Remove unused sd_flag parameter
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 1/7] sched: Add WF_TTWU, WF_EXEC wakeup flags Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 2/7] sched: Kill select_task_rq()'s sd_flag parameter Valentin Schneider
@ 2019-12-11 16:43 ` Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value Valentin Schneider
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

The last use of that parameter was removed by commit

  57abff067a08 ("sched/fair: Rework find_idlest_group()")

Get rid of the parameter.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a614e4794024..f0b2b403bebb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5575,8 +5575,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 }
 
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag);
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu);
 
 /*
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
@@ -5665,7 +5664,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		group = find_idlest_group(sd, p, cpu);
 		if (!group) {
 			sd = sd->child;
 			continue;
@@ -8382,8 +8381,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * Assumes p is allowed on at least one CPU in sd.
  */
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
 	struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
 	struct sg_lb_stats local_sgs, tmp_sgs;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
                   ` (2 preceding siblings ...)
  2019-12-11 16:43 ` [RFC PATCH 3/7] sched/fair: find_idlest_group(): Remove unused " Valentin Schneider
@ 2019-12-11 16:43 ` Valentin Schneider
  2019-12-11 16:53   ` Valentin Schneider
  2019-12-11 16:43 ` [RFC PATCH 5/7] sched/topology: Make {lowest/highest}_flag_domain() work with > 1 flags Valentin Schneider
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

The CFS wakeup code will only ever go through EAS / its fast path on
"regular" wakeups (i.e. not on forks or execs). These are currently gated
by a check against 'sd_flag', which would be SD_BALANCE_WAKE at wakeup.

However, we now have a flag that explicitly tells us whether a wakeup is a
"regular" one, so hinge those conditions on that flag instead.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f0b2b403bebb..30e8d357a24f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6355,7 +6355,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		sd_flag = SD_BALANCE_EXEC;
 	}
 
-	if (sd_flag & SD_BALANCE_WAKE) {
+	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
 		if (sched_energy_enabled()) {
@@ -6396,9 +6396,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (unlikely(sd)) {
 		/* Slow path */
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
-	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
+	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
 		/* Fast path */
-
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 		if (want_affine)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/7] sched/topology: Make {lowest/highest}_flag_domain() work with > 1 flags
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
                   ` (3 preceding siblings ...)
  2019-12-11 16:43 ` [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value Valentin Schneider
@ 2019-12-11 16:43 ` Valentin Schneider
  2019-12-11 16:44 ` [RFC PATCH 6/7] sched/fair: Split select_task_rq_fair want_affine logic Valentin Schneider
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

In some cases, select_task_rq_fair() ends up looking for the highest domain
with both SD_LOAD_BALANCE and one of SD_BALANCE_{WAKE, FORK, EXEC}.
This is pretty much a highest_flag_domain() call, but that latter
can only cope with a single flag.

Make the existing helpers cope with being passed more than one flag. While
at it, rename them to make their newfound powers explicit.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/sched.h    | 23 ++++++++++++++++-------
 kernel/sched/topology.c |  8 ++++----
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62efc0443cac..233b3d41e347 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1340,20 +1340,20 @@ extern void sched_ttwu_pending(void);
 #define for_each_lower_domain(sd) for (; sd; sd = sd->child)
 
 /**
- * highest_flag_domain - Return highest sched_domain containing flag.
+ * highest_flags_domain - Return highest sched_domain containing flags.
  * @cpu:	The CPU whose highest level of sched domain is to
  *		be returned.
- * @flag:	The flag to check for the highest sched_domain
+ * @flags:	The flags to check for the highest sched_domain
  *		for the given CPU.
  *
- * Returns the highest sched_domain of a CPU which contains the given flag.
+ * Returns the highest sched_domain of a CPU which contains all given flags.
  */
-static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+static inline struct sched_domain *highest_flags_domain(int cpu, int flags)
 {
 	struct sched_domain *sd, *hsd = NULL;
 
 	for_each_domain(cpu, sd) {
-		if (!(sd->flags & flag))
+		if (!((sd->flags & flags) == flags))
 			break;
 		hsd = sd;
 	}
@@ -1361,12 +1361,21 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 	return hsd;
 }
 
-static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+/**
+ * lowest_flags_domain - Return lowest sched_domain containing flags.
+ * @cpu:	The CPU whose lowest level of sched domain is to
+ *		be returned.
+ * @flags:	The flags to check for the lowest sched_domain
+ *		for the given CPU.
+ *
+ * Returns the lowest sched_domain of a CPU which contains all given flags.
+ */
+static inline struct sched_domain *lowest_flags_domain(int cpu, int flags)
 {
 	struct sched_domain *sd;
 
 	for_each_domain(cpu, sd) {
-		if (sd->flags & flag)
+		if ((sd->flags & flags) == flags)
 			break;
 	}
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6ec1e595b1d4..f0d2a15fd10b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -631,7 +631,7 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
-	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+	sd = highest_flags_domain(cpu, SD_SHARE_PKG_RESOURCES);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
@@ -643,13 +643,13 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
-	sd = lowest_flag_domain(cpu, SD_NUMA);
+	sd = lowest_flags_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
-	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
+	sd = highest_flags_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
 
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+	sd = lowest_flags_domain(cpu, SD_ASYM_CPUCAPACITY);
 	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 6/7] sched/fair: Split select_task_rq_fair want_affine logic
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
                   ` (4 preceding siblings ...)
  2019-12-11 16:43 ` [RFC PATCH 5/7] sched/topology: Make {lowest/highest}_flag_domain() work with > 1 flags Valentin Schneider
@ 2019-12-11 16:44 ` Valentin Schneider
  2019-12-11 16:44 ` [RFC PATCH 7/7] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan Valentin Schneider
  2020-01-06  8:10 ` [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Peter Zijlstra
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

The domain loop within select_task_rq_fair() depends on a few bits of
input, namely the SD flag we're looking for and whether we want_affine.

For !want_affine, the domain loop will walk up the hierarchy to reach the
highest domain with both SD_LOAD_BALANCE and the requested sd_flag
(SD_BALANCE_{WAKE, FORK, EXEC}) set.
In other words, that's a call to highest_flags_domain() for these two
flags. Note that this is a static information wrt a given SD hierarchy,
so we can cache that - but that comes in a later patch to ease reviewing.

For want_affine, we'll walk up the hierarchy to reach the first domain
with SD_LOAD_BALANCE, SD_WAKE_AFFINE, and that spans the tasks's prev_cpu.
We still save a pointer to the last visited domain that had the requested
sd_flag set (and SD_LOAD_BALANCE), which means that if we fail to go
through the affine condition (e.g. no domain had SD_WAKE_AFFINE) we'll use
the same SD as we would have found if we had !want_affine.

Split the domain loop in !want_affine and want_affine paths. As it is,
this leads to two domain walks instead of a single one, but stay tuned for
the next patch.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30e8d357a24f..ea875c7c82d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6370,29 +6370,36 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	}
 
 	rcu_read_lock();
+
+	sd = highest_flags_domain(cpu, sd_flag | SD_LOAD_BALANCE);
+
+	/*
+	 * If !want_affine, we just look for the highest domain where
+	 * sd_flag is set.
+	 */
+	if (!want_affine)
+		goto scan;
+
+	/*
+	 * Otherwise we look for the lowest domain with SD_WAKE_AFFINE and that
+	 * spans both 'cpu' and 'prev_cpu'.
+	 */
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
 
-		/*
-		 * If both 'cpu' and 'prev_cpu' are part of this domain,
-		 * cpu is a valid SD_WAKE_AFFINE target.
-		 */
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		if ((tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			if (cpu != prev_cpu)
 				new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
 
-			sd = NULL; /* Prefer wake_affine over balance flags */
+			/* Prefer wake_affine over SD lookup */
+			sd = NULL;
 			break;
 		}
-
-		if (tmp->flags & sd_flag)
-			sd = tmp;
-		else if (!want_affine)
-			break;
 	}
 
+scan:
 	if (unlikely(sd)) {
 		/* Slow path */
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 7/7] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
                   ` (5 preceding siblings ...)
  2019-12-11 16:44 ` [RFC PATCH 6/7] sched/fair: Split select_task_rq_fair want_affine logic Valentin Schneider
@ 2019-12-11 16:44 ` Valentin Schneider
  2020-01-06  8:10 ` [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Peter Zijlstra
  7 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

Reworking select_task_rq_fair()'s domain walk exposed that !want_affine
wakeups only look for highest sched_domain with the required sd_flag
set. This is something we can cache at sched domain build time to slightly
optimize select_task_rq_fair(). Note that this isn't a "free" optimization:
it costs us 3 pointers per CPU.

Add cached per-CPU pointers for the highest domains with SD_BALANCE_WAKE,
SD_BALANCE_EXEC and SD_BALANCE_FORK. Do note that these are exclusively
used in select_task_rq_fair(), which also requires SD_LOAD_BALANCE, so the
cached pointers are also gated by SD_LOAD_BALANCE (which is somewhat icky).

There is another nasty thing that comes out of this: with the
sched_domain/* sysctl knobs, one could e.g. clear SD_LOAD_BALANCE for some
CPU(s) (AFAIK the only way to clear that flag ATM). Before this patch,
that would disable entering find_idlest_cpu() from said CPU(s). With this
patch, writing to the SD flags sysctl wouldn't lead to any cached SD
pointer update, so we'd keep going through the slow-path.

This isn't a new problem (all cached domain pointers are affected by this),
but it is a change in behaviour. It could make sense to either turn this
interface read-only, or add a callback in the flags update to properly
update our cached domain data.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c     | 25 +++++++++++++------------
 kernel/sched/sched.h    |  3 +++
 kernel/sched/topology.c | 12 ++++++++++++
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea875c7c82d7..5b72597d6540 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6344,17 +6344,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int want_affine = 0;
 	int sd_flag;
 
-	switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
-	case WF_TTWU:
-		sd_flag = SD_BALANCE_WAKE;
-		break;
-	case WF_FORK:
-		sd_flag = SD_BALANCE_FORK;
-		break;
-	default:
-		sd_flag = SD_BALANCE_EXEC;
-	}
-
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
@@ -6371,7 +6360,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 
 	rcu_read_lock();
 
-	sd = highest_flags_domain(cpu, sd_flag | SD_LOAD_BALANCE);
+	switch (wake_flags & (WF_TTWU | WF_FORK | WF_EXEC)) {
+	case WF_TTWU:
+		sd_flag = SD_BALANCE_WAKE;
+		sd = rcu_dereference(per_cpu(sd_balance_wake, cpu));
+		break;
+	case WF_FORK:
+		sd_flag = SD_BALANCE_FORK;
+		sd = rcu_dereference(per_cpu(sd_balance_fork, cpu));
+		break;
+	default:
+		sd_flag = SD_BALANCE_EXEC;
+		sd = rcu_dereference(per_cpu(sd_balance_exec, cpu));
+	}
 
 	/*
 	 * If !want_affine, we just look for the highest domain where
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 233b3d41e347..ba9d55085c7b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1386,6 +1386,9 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_wake);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_exec);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_balance_fork);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f0d2a15fd10b..f76f40b64279 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -619,6 +619,9 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_wake);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_exec);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_balance_fork);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -643,6 +646,15 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = highest_flags_domain(cpu, SD_BALANCE_WAKE | SD_LOAD_BALANCE);
+	rcu_assign_pointer(per_cpu(sd_balance_wake, cpu), sd);
+
+	sd = highest_flags_domain(cpu, SD_BALANCE_FORK | SD_LOAD_BALANCE);
+	rcu_assign_pointer(per_cpu(sd_balance_fork, cpu), sd);
+
+	sd = highest_flags_domain(cpu, SD_BALANCE_EXEC | SD_LOAD_BALANCE);
+	rcu_assign_pointer(per_cpu(sd_balance_exec, cpu), sd);
+
 	sd = lowest_flags_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value
  2019-12-11 16:43 ` [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value Valentin Schneider
@ 2019-12-11 16:53   ` Valentin Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2019-12-11 16:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, peterz, vincent.guittot, dietmar.eggemann

On 11/12/2019 16:43, Valentin Schneider wrote:
> @@ -6396,9 +6396,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	if (unlikely(sd)) {
>  		/* Slow path */
>  		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
> -	} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
> +	} else if (wake_flags & WF_TTWU) { /* XXX always ? */

While I'm at it, Dietmar pointed out to me that this is only really
relevant to forkees and execees when a NULL domain is attached to the CPU
(since sd_init() unconditionally sets SD_BALANCE_{FORK, EXEC}). So this
only makes a difference when the SD hierarchy hasn't been built / is being
rebuilt, or when cpusets are involved.

So perhaps we could make that an unconditional else, or make forkees/execees
bail out earlier.

>  		/* Fast path */
> -
>  		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>  
>  		if (want_affine)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair()
  2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
                   ` (6 preceding siblings ...)
  2019-12-11 16:44 ` [RFC PATCH 7/7] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan Valentin Schneider
@ 2020-01-06  8:10 ` Peter Zijlstra
  7 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2020-01-06  8:10 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, mingo, vincent.guittot, dietmar.eggemann,
	Juri Lelli, Steven Rostedt

On Wed, Dec 11, 2019 at 04:43:54PM +0000, Valentin Schneider wrote:
> Discussion points
> =================
> 
> The use of SD_LOAD_BALANCE forced me to do a few ugly things.
> 
> Patch 5 is only required because the domain walk in select_task_rq_fair()
> is ceiled by the presence of SD_LOAD_BALANCE. Thing is (I also ramble about
> this in the changelog of patch 7) AFAIK this flag is set unconditionally in
> sd_init() and the only way to clear it is via the sched_domain sysctl,
> which is a SCHED_DEBUG interface. I haven't found anything with cpusets
> that would clear it; AFAICT cpusets can end up attaching the NULL domain to
> some CPUs (with sched_load_balance=0) but that's as far as this goes:

I can't find it in a hurry, but cpusets should really be able to build
stuff without LOAD_BALANCe on, otherwise it is broken.

/me digs a little into the code and finds that. no, you're right. What
cpusets does is create non-overlapping domain trees for each parition.
Which avoids the need to clear that flag.

Hmmm.. if we double check all that, I don't suppose there is anything
against simply removing that flag. Less is more etc..

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-01-06  8:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-11 16:43 [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Valentin Schneider
2019-12-11 16:43 ` [RFC PATCH 1/7] sched: Add WF_TTWU, WF_EXEC wakeup flags Valentin Schneider
2019-12-11 16:43 ` [RFC PATCH 2/7] sched: Kill select_task_rq()'s sd_flag parameter Valentin Schneider
2019-12-11 16:43 ` [RFC PATCH 3/7] sched/fair: find_idlest_group(): Remove unused " Valentin Schneider
2019-12-11 16:43 ` [RFC PATCH 4/7] sched/fair: Dissociate wakeup decisions from SD flag value Valentin Schneider
2019-12-11 16:53   ` Valentin Schneider
2019-12-11 16:43 ` [RFC PATCH 5/7] sched/topology: Make {lowest/highest}_flag_domain() work with > 1 flags Valentin Schneider
2019-12-11 16:44 ` [RFC PATCH 6/7] sched/fair: Split select_task_rq_fair want_affine logic Valentin Schneider
2019-12-11 16:44 ` [RFC PATCH 7/7] sched/topology: Define and use shortcut pointers for wakeup sd_flag scan Valentin Schneider
2020-01-06  8:10 ` [RFC PATCH 0/7] sched: Streamline select_task_rq() & select_task_rq_fair() Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).