linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] sched: task load tracking optimization and cleanup
@ 2022-07-09 15:13 Chengming Zhou
  2022-07-09 15:13 ` [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

Hi all,

This patch series is optimization and cleanup for task load tracking
when task migrate CPU/cgroup or switched_from/to_fair().

There are three types of detach/attach_entity_load_avg (except fork and
exit case) for a fair task:
1. task migrate CPU (on_rq migrate or wake_up migrate)
2. task migrate cgroup (detach then attach)
3. task switched_from/to_fair (detach later attach)

patch1 optimize the on_rq migrate CPU case by combine detach into dequeue,
so we don't need to do detach_entity_cfs_rq() in migrate_task_rq_fair()
any more.

patch3 cleanup the migrate cgroup case by remove cpu_cgrp_subsys->fork(),
since we already do the same thing in sched_cgroup_fork().

patch1-3 have been reviewed earlier, but conflicts with the current tip
tree, so include them here as a patchset. Sorry for the inconvenience.

patch5-6 fix load tracking for new forked !fair task and when task
switched_from_fair().

After these changes, the task sched_avg last_update_time is reset to 0
when migrate from CPU/cgroup or switched_from_fair(), to save updated
sched_avg for next attach.

patch7-8 delete superfluous set_task_rq_fair() and SKIP_AGE_LOAD.

Chengming Zhou (8):
  sched/fair: combine detach into dequeue when migrating task
  sched/fair: update comments in enqueue/dequeue_entity()
  sched/fair: remove redundant cpu_cgrp_subsys->fork()
  sched/fair: reset sched_avg last_update_time before set_task_rq()
  sched/fair: fix load tracking for new forked !fair task
  sched/fair: stop load tracking when task switched_from_fair()
  sched/fair: delete superfluous set_task_rq_fair()
  sched/fair: delete superfluous SKIP_AGE_LOAD

 kernel/sched/core.c     |  27 ++------
 kernel/sched/fair.c     | 139 ++++++++++------------------------------
 kernel/sched/features.h |   1 -
 kernel/sched/sched.h    |  14 +---
 4 files changed, 41 insertions(+), 140 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-11 13:07   ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 2/8] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

When we are migrating task out of the CPU, we can combine detach and
propagation into dequeue_entity() to save the detach_entity_cfs_rq()
in migrate_task_rq_fair().

This optimization is like combining DO_ATTACH in the enqueue_entity()
when migrating task to the CPU.

So we don't have to traverse the CFS tree extra time to do the
detach_entity_cfs_rq() -> propagate_entity_cfs_rq() call, which
wouldn't be called anymore with this patch's change.

detach_task()
  deactivate_task()
    dequeue_task_fair()
      for_each_sched_entity(se)
        dequeue_entity()
          update_load_avg() /* (1) */
            detach_entity_load_avg()

  set_task_cpu()
    migrate_task_rq_fair()
      detach_entity_cfs_rq() /* (2) */
        update_load_avg();
        detach_entity_load_avg();
        propagate_entity_cfs_rq();
          for_each_sched_entity()
            update_load_avg()

This patch save the detach_entity_cfs_rq() called in (2) by doing
the detach_entity_load_avg() for a CPU migrating task inside (1)
(the task being the first se in the loop)

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a78d2e3b9d49..0689b94ed70b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4003,6 +4003,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define UPDATE_TG	0x1
 #define SKIP_AGE_LOAD	0x2
 #define DO_ATTACH	0x4
+#define DO_DETACH	0x8
 
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -4020,7 +4021,14 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
 
-	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
+	if (flags & DO_DETACH) {
+		/*
+		 * DO_DETACH means we're here from dequeue_entity()
+		 * and we are migrating task out of the CPU.
+		 */
+		detach_entity_load_avg(cfs_rq, se);
+		update_tg_load_avg(cfs_rq);
+	} else if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
 
 		/*
 		 * DO_ATTACH means we're here from enqueue_entity().
@@ -4292,6 +4300,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
+#define DO_DETACH	0x0
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -4511,6 +4520,11 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
+	int action = UPDATE_TG;
+
+	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
+		action |= DO_DETACH;
+
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
@@ -4524,7 +4538,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, action);
 	se_update_runnable(se);
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
@@ -7076,8 +7090,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	return new_cpu;
 }
 
-static void detach_entity_cfs_rq(struct sched_entity *se);
-
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -7099,15 +7111,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
 	}
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
-		/*
-		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
-		 * rq->lock and can modify state directly.
-		 */
-		lockdep_assert_rq_held(task_rq(p));
-		detach_entity_cfs_rq(se);
-
-	} else {
+	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
 		/*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] sched/fair: update comments in enqueue/dequeue_entity()
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
  2022-07-09 15:13 ` [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

When reading the sched_avg related code, I found the comments in
enqueue/dequeue_entity() are not updated with the current code.

We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg
during enqueue/dequeue_entity(), those are done only for attach/detach.

This patch updates the comments to reflect the current code working.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0689b94ed70b..2a3e12ead144 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4443,7 +4443,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
+	 *   - For group_entity, update its runnable_weight to reflect the new
+	 *     h_nr_running of its group cfs_rq.
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
@@ -4533,7 +4534,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
+	 *   - For group_entity, update its runnable_weight to reflect the new
+	 *     h_nr_running of its group cfs_rq.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
  2022-07-09 15:13 ` [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
  2022-07-09 15:13 ` [PATCH 2/8] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-11  7:35   ` Peter Zijlstra
  2022-07-09 15:13 ` [PATCH 4/8] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().

Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
has already set task group for the new fair task in sched_cgroup_fork(),
so cpu_cgrp_subsys->fork() can be removed.

  cgroup_can_fork()	--> pin parent's sched_task_group
  sched_cgroup_fork()
    __set_task_cpu	--> set task group
  cgroup_post_fork()
    ss->fork() := cpu_cgroup_fork()	--> set again

After this patch's change, task_change_group_fair() only need to
care about task cgroup migration, make the code much simplier.

This patch also move the task se depth setting to set_task_rq(), which
will set correct depth for the new task se in sched_cgroup_fork().

The se depth setting in attach_entity_cfs_rq() is removed since
set_task_rq() is a better place to do this when task moves across
CPUs/groups.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 27 ++++-----------------------
 kernel/sched/fair.c  | 31 +------------------------------
 kernel/sched/sched.h |  6 ++----
 3 files changed, 7 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c215b5adc707..d85fdea51e3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -481,8 +481,7 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *				p->se.load, p->rt_priority,
  *				p->dl.dl_{runtime, deadline, period, flags, bw, density}
  *  - sched_setnuma():		p->numa_preferred_nid
- *  - sched_move_task()/
- *    cpu_cgroup_fork():	p->sched_task_group
+ *  - sched_move_task():	p->sched_task_group
  *  - uclamp_update_active()	p->uclamp*
  *
  * p->state <- TASK_*:
@@ -10127,7 +10126,7 @@ void sched_release_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-static void sched_change_group(struct task_struct *tsk, int type)
+static void sched_change_group(struct task_struct *tsk)
 {
 	struct task_group *tg;
 
@@ -10143,7 +10142,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_change_group)
-		tsk->sched_class->task_change_group(tsk, type);
+		tsk->sched_class->task_change_group(tsk);
 	else
 #endif
 		set_task_rq(tsk, task_cpu(tsk));
@@ -10174,7 +10173,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (running)
 		put_prev_task(rq, tsk);
 
-	sched_change_group(tsk, TASK_MOVE_GROUP);
+	sched_change_group(tsk);
 
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
@@ -10252,23 +10251,6 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	sched_unregister_group(tg);
 }
 
-/*
- * This is called before wake_up_new_task(), therefore we really only
- * have to set its group bits, all the other stuff does not apply.
- */
-static void cpu_cgroup_fork(struct task_struct *task)
-{
-	struct rq_flags rf;
-	struct rq *rq;
-
-	rq = task_rq_lock(task, &rf);
-
-	update_rq_clock(rq);
-	sched_change_group(task, TASK_SET_GROUP);
-
-	task_rq_unlock(rq, task, &rf);
-}
-
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
@@ -11134,7 +11116,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
-	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a3e12ead144..8992ce5e73d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11539,14 +11539,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	/*
-	 * Since the real-depth could have been changed (only FAIR
-	 * class maintain depth value), reset depth properly.
-	 */
-	se->depth = se->parent ? se->parent->depth + 1 : 0;
-#endif
-
 	/* Synchronize entity with its cfs_rq */
 	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
@@ -11642,15 +11634,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void task_set_group_fair(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	set_task_rq(p, task_cpu(p));
-	se->depth = se->parent ? se->parent->depth + 1 : 0;
-}
-
-static void task_move_group_fair(struct task_struct *p)
+static void task_change_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
 	set_task_rq(p, task_cpu(p));
@@ -11662,19 +11646,6 @@ static void task_move_group_fair(struct task_struct *p)
 	attach_task_cfs_rq(p);
 }
 
-static void task_change_group_fair(struct task_struct *p, int type)
-{
-	switch (type) {
-	case TASK_SET_GROUP:
-		task_set_group_fair(p);
-		break;
-
-	case TASK_MOVE_GROUP:
-		task_move_group_fair(p);
-		break;
-	}
-}
-
 void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aad7f5ee9666..19e0076e4245 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1940,6 +1940,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
+	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -2202,11 +2203,8 @@ struct sched_class {
 
 	void (*update_curr)(struct rq *rq);
 
-#define TASK_SET_GROUP		0
-#define TASK_MOVE_GROUP		1
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	void (*task_change_group)(struct task_struct *p, int type);
+	void (*task_change_group)(struct task_struct *p);
 #endif
 };
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] sched/fair: reset sched_avg last_update_time before set_task_rq()
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (2 preceding siblings ...)
  2022-07-09 15:13 ` [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 5/8] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
task's sched_avg when migrate, which is not needed for already detached
task.

task_change_group_fair() will detached the task sched_avg from prev cfs_rq
first, so reset sched_avg last_update_time before set_task_rq() to avoid that.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8992ce5e73d2..171bc22bc142 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11637,12 +11637,12 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 static void task_change_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
-	set_task_rq(p, task_cpu(p));
 
 #ifdef CONFIG_SMP
 	/* Tell se's cfs_rq has been changed -- migrated */
 	p->se.avg.last_update_time = 0;
 #endif
+	set_task_rq(p, task_cpu(p));
 	attach_task_cfs_rq(p);
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] sched/fair: fix load tracking for new forked !fair task
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (3 preceding siblings ...)
  2022-07-09 15:13 ` [PATCH 4/8] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 6/8] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

New forked !fair task will set its sched_avg last_update_time to
the pelt_clock of cfs_rq, after a while in switched_to_fair():

switched_to_fair
  attach_task_cfs_rq
    attach_entity_cfs_rq
      update_load_avg
        __update_load_avg_se(now, cfs_rq, se)

the delta (now - sa->last_update_time) will contribute/decay sched_avg
depends on the task running/runnable status at that time.

This patch don't set sched_avg last_update_time of new forked !fair
task, leave it to 0. So later in update_load_avg(), we don't need to
contribute/decay the wrong delta (now - sa->last_update_time).

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 171bc22bc142..153a2c6c1069 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -849,22 +849,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 
 	sa->runnable_avg = sa->util_avg;
 
-	if (p->sched_class != &fair_sched_class) {
-		/*
-		 * For !fair tasks do:
-		 *
-		update_cfs_rq_load_avg(now, cfs_rq);
-		attach_entity_load_avg(cfs_rq, se);
-		switched_from_fair(rq, p);
-		 *
-		 * such that the next switched_to_fair() has the
-		 * expected state.
-		 */
-		se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
-		return;
-	}
-
-	attach_entity_cfs_rq(se);
+	if (p->sched_class == &fair_sched_class)
+		attach_entity_cfs_rq(se);
 }
 
 #else /* !CONFIG_SMP */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (4 preceding siblings ...)
  2022-07-09 15:13 ` [PATCH 5/8] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 7/8] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou
  2022-07-09 15:13 ` [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
  7 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

The same reason as the previous commit, if we don't reset the
sched_avg last_update_time to 0, after a while in switched_to_fair():

switched_to_fair
  attach_task_cfs_rq
    attach_entity_cfs_rq
      update_load_avg
        __update_load_avg_se(now, cfs_rq, se)

The delta (now - sa->last_update_time) will wrongly contribute/decay
sched_avg depends on the task running/runnable status at that time.

This patch reset it's sched_avg last_update_time to 0, stop load
tracking for !fair task, later in switched_to_fair() ->
update_load_avg(), we can use its saved sched_avg.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 153a2c6c1069..ca714eedeec5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11563,6 +11563,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+
+#ifdef CONFIG_SMP
+	/* Stop load tracking for !fair task */
+	p->se.avg.last_update_time = 0;
+#endif
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] sched/fair: delete superfluous set_task_rq_fair()
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (5 preceding siblings ...)
  2022-07-09 15:13 ` [PATCH 6/8] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-09 15:13 ` [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
  7 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

set_task_rq() is used when move task across CPUs/groups to change
its cfs_rq and parent entity, and it will call set_task_rq_fair()
to sync blocked task load_avg just before change its cfs_rq.

1. task migrate CPU: will detach/remove from prev cfs_rq and reset
   its sched_avg last_update_time to 0, so don't need to sync again.

2. task migrate cgroup: will detach from prev cfs_rq and reset its
   sched_avg last_update_time to 0, so don't need to sync too.

3. !fair task migrate CPU/cgroup: we stop load tracking for !fair task,
   reset sched_avg last_update_time to 0 when switched_from_fair(), so
   don't need it too.

So set_task_rq_fair() is not needed anymore, this patch delete it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c  | 31 -------------------------------
 kernel/sched/sched.h |  8 --------
 2 files changed, 39 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca714eedeec5..b0bde895ba96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3430,37 +3430,6 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	}
 }
 
-/*
- * Called within set_task_rq() right before setting a task's CPU. The
- * caller only guarantees p->pi_lock is held; no other assumptions,
- * including the state of rq->lock, should be made.
- */
-void set_task_rq_fair(struct sched_entity *se,
-		      struct cfs_rq *prev, struct cfs_rq *next)
-{
-	u64 p_last_update_time;
-	u64 n_last_update_time;
-
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		return;
-
-	/*
-	 * We are supposed to update the task to "current" time, then its up to
-	 * date and ready to go to new CPU/cfs_rq. But we have difficulty in
-	 * getting what current time is, so simply throw away the out-of-date
-	 * time. This will result in the wakee task is less decayed, but giving
-	 * the wakee more load sounds not bad.
-	 */
-	if (!(se->avg.last_update_time && prev))
-		return;
-
-	p_last_update_time = cfs_rq_last_update_time(prev);
-	n_last_update_time = cfs_rq_last_update_time(next);
-
-	__update_load_avg_blocked_se(p_last_update_time, se);
-	se->avg.last_update_time = n_last_update_time;
-}
-
 /*
  * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
  * propagate its contribution. The key to this propagation is the invariant
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19e0076e4245..a8ec7af4bd51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -505,13 +505,6 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
-#ifdef CONFIG_SMP
-extern void set_task_rq_fair(struct sched_entity *se,
-			     struct cfs_rq *prev, struct cfs_rq *next);
-#else /* !CONFIG_SMP */
-static inline void set_task_rq_fair(struct sched_entity *se,
-			     struct cfs_rq *prev, struct cfs_rq *next) { }
-#endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #else /* CONFIG_CGROUP_SCHED */
@@ -1937,7 +1930,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
 	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD
  2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (6 preceding siblings ...)
  2022-07-09 15:13 ` [PATCH 7/8] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou
@ 2022-07-09 15:13 ` Chengming Zhou
  2022-07-11 16:54   ` Chengming Zhou
  7 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-07-09 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

All three attach_entity_cfs_rq() types:

1. task migrate CPU
2. task migrate cgroup
3. task switched to fair

have its sched_avg last_update_time reset to 0 when
attach_entity_cfs_rq() -> update_load_avg(), so it makes
no difference whether SKIP_AGE_LOAD is set or not.

This patch delete the superfluous SKIP_AGE_LOAD, and the unused
feature ATTACH_AGE_LOAD together.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c     | 18 ++++++------------
 kernel/sched/features.h |  1 -
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0bde895ba96..b91643a2143e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3956,9 +3956,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
  * Optional action to be done while updating the load average
  */
 #define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-#define DO_ATTACH	0x4
-#define DO_DETACH	0x8
+#define DO_ATTACH	0x2
+#define DO_DETACH	0x4
 
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -3970,7 +3969,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+	if (se->avg.last_update_time)
 		__update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
@@ -4253,7 +4252,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 }
 
 #define UPDATE_TG	0x0
-#define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
 #define DO_DETACH	0x0
 
@@ -11484,9 +11482,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_load_avg(cfs_rq, se, 0);
-	detach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq);
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
 	propagate_entity_cfs_rq(se);
 }
 
@@ -11494,10 +11490,8 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	/* Synchronize entity with its cfs_rq */
-	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
-	attach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq);
+	/* Synchronize entity with its cfs_rq and attach our load */
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	propagate_entity_cfs_rq(se);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..fb92431d496f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,7 +85,6 @@ SCHED_FEAT(RT_PUSH_IPI, true)
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)
 SCHED_FEAT(LB_MIN, false)
-SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-09 15:13 ` [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
@ 2022-07-11  7:35   ` Peter Zijlstra
  2022-07-11 13:02     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2022-07-11  7:35 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	vschneid, linux-kernel

On Sat, Jul 09, 2022 at 11:13:48PM +0800, Chengming Zhou wrote:
> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
> in cgroup_post_fork().
> 
> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
> has already set task group for the new fair task in sched_cgroup_fork(),
> so cpu_cgrp_subsys->fork() can be removed.
> 
>   cgroup_can_fork()	--> pin parent's sched_task_group
>   sched_cgroup_fork()
>     __set_task_cpu	--> set task group
>   cgroup_post_fork()
>     ss->fork() := cpu_cgroup_fork()	--> set again
> 
> After this patch's change, task_change_group_fair() only need to
> care about task cgroup migration, make the code much simplier.

This:

> This patch also move the task se depth setting to set_task_rq(), which
> will set correct depth for the new task se in sched_cgroup_fork().
> 
> The se depth setting in attach_entity_cfs_rq() is removed since
> set_task_rq() is a better place to do this when task moves across
> CPUs/groups.

really should have been it's own patch. And this actually scares me. Did
you test with priority inheritance bumping the task to FIFO while things
change?

This has nothing to do with fork().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-11  7:35   ` Peter Zijlstra
@ 2022-07-11 13:02     ` Chengming Zhou
  2022-07-11 13:53       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-07-11 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	vschneid, linux-kernel

On 2022/7/11 15:35, Peter Zijlstra wrote:
> On Sat, Jul 09, 2022 at 11:13:48PM +0800, Chengming Zhou wrote:
>> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
>> in cgroup_post_fork().
>>
>> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
>> has already set task group for the new fair task in sched_cgroup_fork(),
>> so cpu_cgrp_subsys->fork() can be removed.
>>
>>   cgroup_can_fork()	--> pin parent's sched_task_group
>>   sched_cgroup_fork()
>>     __set_task_cpu	--> set task group
>>   cgroup_post_fork()
>>     ss->fork() := cpu_cgroup_fork()	--> set again
>>
>> After this patch's change, task_change_group_fair() only need to
>> care about task cgroup migration, make the code much simplier.
> 
> This:
> 
>> This patch also move the task se depth setting to set_task_rq(), which
>> will set correct depth for the new task se in sched_cgroup_fork().
>>
>> The se depth setting in attach_entity_cfs_rq() is removed since
>> set_task_rq() is a better place to do this when task moves across
>> CPUs/groups.
> 
> really should have been it's own patch. And this actually scares me. Did
> you test with priority inheritance bumping the task to FIFO while things
> change?
> 
> This has nothing to do with fork().

Ok, will put this in another patch, so this patch still need this line:

  p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;

in set_task_rq() to set depth for new forked task.


I didn't test with "priority inheritance bumping the task to FIFO" case,
do you mean the rt_mutex_setprio() bump a fair task to FIFO?

Sorry, I don't get how removing depth setting in attach_entity_cfs_rq()
affect that. Could you explain more so I can test it?

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task
  2022-07-09 15:13 ` [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
@ 2022-07-11 13:07   ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-11 13:07 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, vschneid
  Cc: linux-kernel

On 2022/7/9 23:13, Chengming Zhou wrote:
> When we are migrating task out of the CPU, we can combine detach and
> propagation into dequeue_entity() to save the detach_entity_cfs_rq()
> in migrate_task_rq_fair().
> 
> This optimization is like combining DO_ATTACH in the enqueue_entity()
> when migrating task to the CPU.
> 
> So we don't have to traverse the CFS tree extra time to do the
> detach_entity_cfs_rq() -> propagate_entity_cfs_rq() call, which
> wouldn't be called anymore with this patch's change.
> 
> detach_task()
>   deactivate_task()
>     dequeue_task_fair()
>       for_each_sched_entity(se)
>         dequeue_entity()
>           update_load_avg() /* (1) */
>             detach_entity_load_avg()
> 
>   set_task_cpu()
>     migrate_task_rq_fair()
>       detach_entity_cfs_rq() /* (2) */
>         update_load_avg();
>         detach_entity_load_avg();
>         propagate_entity_cfs_rq();
>           for_each_sched_entity()
>             update_load_avg()
> 
> This patch save the detach_entity_cfs_rq() called in (2) by doing
> the detach_entity_load_avg() for a CPU migrating task inside (1)
> (the task being the first se in the loop)
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 30 +++++++++++++++++-------------
>  1 file changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a78d2e3b9d49..0689b94ed70b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4003,6 +4003,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  #define UPDATE_TG	0x1
>  #define SKIP_AGE_LOAD	0x2
>  #define DO_ATTACH	0x4
> +#define DO_DETACH	0x8
>  
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -4020,7 +4021,14 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
>  	decayed |= propagate_entity_load_avg(se);
>  
> -	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> +	if (flags & DO_DETACH) {
> +		/*
> +		 * DO_DETACH means we're here from dequeue_entity()
> +		 * and we are migrating task out of the CPU.
> +		 */
> +		detach_entity_load_avg(cfs_rq, se);
> +		update_tg_load_avg(cfs_rq);
> +	} else if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

Should I put "DO_DETACH" case after "DO_ATTACH" case? The "DO_ATTACH" maybe more likely, right?

Thanks.

>  
>  		/*
>  		 * DO_ATTACH means we're here from enqueue_entity().
> @@ -4292,6 +4300,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  #define UPDATE_TG	0x0
>  #define SKIP_AGE_LOAD	0x0
>  #define DO_ATTACH	0x0
> +#define DO_DETACH	0x0
>  
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
> @@ -4511,6 +4520,11 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  static void
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> +	int action = UPDATE_TG;
> +
> +	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> +		action |= DO_DETACH;
> +
>  	/*
>  	 * Update run-time statistics of the 'current'.
>  	 */
> @@ -4524,7 +4538,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 *   - For group entity, update its weight to reflect the new share
>  	 *     of its group cfs_rq.
>  	 */
> -	update_load_avg(cfs_rq, se, UPDATE_TG);
> +	update_load_avg(cfs_rq, se, action);
>  	se_update_runnable(se);
>  
>  	update_stats_dequeue_fair(cfs_rq, se, flags);
> @@ -7076,8 +7090,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	return new_cpu;
>  }
>  
> -static void detach_entity_cfs_rq(struct sched_entity *se);
> -
>  /*
>   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
>   * cfs_rq_of(p) references at time of call are still valid and identify the
> @@ -7099,15 +7111,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
>  	}
>  
> -	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
> -		/*
> -		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> -		 * rq->lock and can modify state directly.
> -		 */
> -		lockdep_assert_rq_held(task_rq(p));
> -		detach_entity_cfs_rq(se);
> -
> -	} else {
> +	if (!task_on_rq_migrating(p)) {
>  		remove_entity_load_avg(se);
>  
>  		/*

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-11 13:02     ` [External] " Chengming Zhou
@ 2022-07-11 13:53       ` Peter Zijlstra
  2022-07-11 16:25         ` Chengming Zhou
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2022-07-11 13:53 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	vschneid, linux-kernel

On Mon, Jul 11, 2022 at 09:02:07PM +0800, Chengming Zhou wrote:
> On 2022/7/11 15:35, Peter Zijlstra wrote:
> > On Sat, Jul 09, 2022 at 11:13:48PM +0800, Chengming Zhou wrote:
> >> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
> >> in cgroup_post_fork().
> >>
> >> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
> >> has already set task group for the new fair task in sched_cgroup_fork(),
> >> so cpu_cgrp_subsys->fork() can be removed.
> >>
> >>   cgroup_can_fork()	--> pin parent's sched_task_group
> >>   sched_cgroup_fork()
> >>     __set_task_cpu	--> set task group
> >>   cgroup_post_fork()
> >>     ss->fork() := cpu_cgroup_fork()	--> set again
> >>
> >> After this patch's change, task_change_group_fair() only need to
> >> care about task cgroup migration, make the code much simplier.
> > 
> > This:
> > 
> >> This patch also move the task se depth setting to set_task_rq(), which
> >> will set correct depth for the new task se in sched_cgroup_fork().
> >>
> >> The se depth setting in attach_entity_cfs_rq() is removed since
> >> set_task_rq() is a better place to do this when task moves across
> >> CPUs/groups.
> > 
> > really should have been it's own patch. And this actually scares me. Did
> > you test with priority inheritance bumping the task to FIFO while things
> > change?
> > 
> > This has nothing to do with fork().
> 
> Ok, will put this in another patch, so this patch still need this line:
> 
>   p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
> 
> in set_task_rq() to set depth for new forked task.

That would suggest you ordered your patches wrong.

> I didn't test with "priority inheritance bumping the task to FIFO" case,
> do you mean the rt_mutex_setprio() bump a fair task to FIFO?
> 
> Sorry, I don't get how removing depth setting in attach_entity_cfs_rq()
> affect that. Could you explain more so I can test it?

Well, if you look at the commit that introduced that code:

  eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")

then it's clear that the original problem was the task temporarily not
being in the fair class. The most common way for that to be so is
through PI.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-11 13:53       ` Peter Zijlstra
@ 2022-07-11 16:25         ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-11 16:25 UTC (permalink / raw)
  To: Peter Zijlstra, Chengming Zhou
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	vschneid, linux-kernel

On 2022/7/11 21:53, Peter Zijlstra wrote:
> On Mon, Jul 11, 2022 at 09:02:07PM +0800, Chengming Zhou wrote:
>> On 2022/7/11 15:35, Peter Zijlstra wrote:
>>> On Sat, Jul 09, 2022 at 11:13:48PM +0800, Chengming Zhou wrote:
>>>> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
>>>> in cgroup_post_fork().
>>>>
>>>> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
>>>> has already set task group for the new fair task in sched_cgroup_fork(),
>>>> so cpu_cgrp_subsys->fork() can be removed.
>>>>
>>>>   cgroup_can_fork()	--> pin parent's sched_task_group
>>>>   sched_cgroup_fork()
>>>>     __set_task_cpu	--> set task group
>>>>   cgroup_post_fork()
>>>>     ss->fork() := cpu_cgroup_fork()	--> set again
>>>>
>>>> After this patch's change, task_change_group_fair() only need to
>>>> care about task cgroup migration, make the code much simplier.
>>>
>>> This:
>>>
>>>> This patch also move the task se depth setting to set_task_rq(), which
>>>> will set correct depth for the new task se in sched_cgroup_fork().
>>>>
>>>> The se depth setting in attach_entity_cfs_rq() is removed since
>>>> set_task_rq() is a better place to do this when task moves across
>>>> CPUs/groups.
>>>
>>> really should have been it's own patch. And this actually scares me. Did
>>> you test with priority inheritance bumping the task to FIFO while things
>>> change?
>>>
>>> This has nothing to do with fork().
>>
>> Ok, will put this in another patch, so this patch still need this line:
>>
>>   p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
>>
>> in set_task_rq() to set depth for new forked task.
> 
> That would suggest you ordered your patches wrong.

Yeah, I understand now, should put this se depth change before..

> 
>> I didn't test with "priority inheritance bumping the task to FIFO" case,
>> do you mean the rt_mutex_setprio() bump a fair task to FIFO?
>>
>> Sorry, I don't get how removing depth setting in attach_entity_cfs_rq()
>> affect that. Could you explain more so I can test it?
> 
> Well, if you look at the commit that introduced that code:
> 
>   eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")
> 
> then it's clear that the original problem was the task temporarily not
> being in the fair class. The most common way for that to be so is
> through PI.

Thanks for the hint, I read that commit, should be no problem if we have
se depth setting in set_task_rq().

That problem was we only maintain task se depth in task_move_group_fair(),
if a !fair task move task group, its se depth will not be updated, so
that commit fix the problem by update se depth in switched_to_fair().

In this patch we maintain se depth setting in set_task_rq(), which will be
called when CPU/cgroup change, so its depth will always be correct. So it's
ok to remove task se depth setting in attach_entity_cfs_rq().

Thanks.


> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD
  2022-07-09 15:13 ` [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
@ 2022-07-11 16:54   ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-07-11 16:54 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/9 23:13, Chengming Zhou wrote:
> All three attach_entity_cfs_rq() types:
> 
> 1. task migrate CPU
> 2. task migrate cgroup
> 3. task switched to fair
> 
> have its sched_avg last_update_time reset to 0 when
> attach_entity_cfs_rq() -> update_load_avg(), so it makes
> no difference whether SKIP_AGE_LOAD is set or not.
> 
> This patch delete the superfluous SKIP_AGE_LOAD, and the unused
> feature ATTACH_AGE_LOAD together.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/fair.c     | 18 ++++++------------
>  kernel/sched/features.h |  1 -
>  2 files changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0bde895ba96..b91643a2143e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3956,9 +3956,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>   * Optional action to be done while updating the load average
>   */
>  #define UPDATE_TG	0x1
> -#define SKIP_AGE_LOAD	0x2
> -#define DO_ATTACH	0x4
> -#define DO_DETACH	0x8
> +#define DO_ATTACH	0x2
> +#define DO_DETACH	0x4
>  
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> @@ -3970,7 +3969,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	 * Track task load average for carrying it to new CPU after migrated, and
>  	 * track group sched_entity load average for task_h_load calc in migration
>  	 */
> -	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> +	if (se->avg.last_update_time)
>  		__update_load_avg_se(now, cfs_rq, se);
>  
>  	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> @@ -4253,7 +4252,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  }
>  
>  #define UPDATE_TG	0x0
> -#define SKIP_AGE_LOAD	0x0
>  #define DO_ATTACH	0x0
>  #define DO_DETACH	0x0
>  
> @@ -11484,9 +11482,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  	/* Catch up with the cfs_rq and remove our load when we leave */
> -	update_load_avg(cfs_rq, se, 0);
> -	detach_entity_load_avg(cfs_rq, se);
> -	update_tg_load_avg(cfs_rq);
> +	update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>  	propagate_entity_cfs_rq(se);
>  }
>  
> @@ -11494,10 +11490,8 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
> -	/* Synchronize entity with its cfs_rq */
> -	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
> -	attach_entity_load_avg(cfs_rq, se);
> -	update_tg_load_avg(cfs_rq);
> +	/* Synchronize entity with its cfs_rq and attach our load */
> +	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
>  	propagate_entity_cfs_rq(se);
>  }

Looks like I order this change wrong, this change should be put before patch5-6.

Because this change will check last_update_time before attach_entity_load_avg()
in update_load_avg(), instead of unconditional attach_entity_load_avg() here.


Problem case: switch to fair class

p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
if (queued)
  enqueue_task(p);
    ...
      enqueue_entity()
        update_load_avg(UPDATE_TG | DO_ATTACH)
          if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
            attach_entity_load_avg()  --> attached, will set last_update_time
check_class_changed()
  switched_from() (!fair)
  switched_to()   (fair)
    switched_to_fair()
      attach_entity_load_avg()  --> unconditional attach again!


If we use unconditional attach_entity_load_avg() in switched_to_fair(), the
above twice attach problem will happen, see details in commit 7dc603c9028e

In this patch, we also use update_load_avg(UPDATE_TG | DO_ATTACH) in
switched_to_fair(), so no twice attach problem will happen since we check
last_update_time in update_load_avg().

Thanks.

>  
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..fb92431d496f 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,7 +85,6 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>  
>  SCHED_FEAT(RT_RUNTIME_SHARE, false)
>  SCHED_FEAT(LB_MIN, false)
> -SCHED_FEAT(ATTACH_AGE_LOAD, true)
>  
>  SCHED_FEAT(WA_IDLE, true)
>  SCHED_FEAT(WA_WEIGHT, true)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-07-11 16:55 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-09 15:13 [PATCH 0/8] sched: task load tracking optimization and cleanup Chengming Zhou
2022-07-09 15:13 ` [PATCH 1/8] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
2022-07-11 13:07   ` Chengming Zhou
2022-07-09 15:13 ` [PATCH 2/8] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
2022-07-09 15:13 ` [PATCH 3/8] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
2022-07-11  7:35   ` Peter Zijlstra
2022-07-11 13:02     ` [External] " Chengming Zhou
2022-07-11 13:53       ` Peter Zijlstra
2022-07-11 16:25         ` Chengming Zhou
2022-07-09 15:13 ` [PATCH 4/8] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
2022-07-09 15:13 ` [PATCH 5/8] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
2022-07-09 15:13 ` [PATCH 6/8] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
2022-07-09 15:13 ` [PATCH 7/8] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou
2022-07-09 15:13 ` [PATCH 8/8] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
2022-07-11 16:54   ` Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).