[PATCH v2 00/10] sched: task load tracking optimization and cleanup

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/10] sched: task load tracking optimization and cleanup
@ 2022-07-13  4:04 Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 01/10] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

Hi all,

This patch series is optimization and cleanup for task load tracking
when task migrate CPU/cgroup or switched_from/to_fair().

There are three types of detach/attach_entity_load_avg (except fork and
exit case) for a fair task:
1. task migrate CPU (on_rq migrate or wake_up migrate)
2. task migrate cgroup (detach then attach)
3. task switched_from/to_fair (detach later attach)

patch1 optimize the on_rq migrate CPU case by combine detach into dequeue,
so we don't need to do detach_entity_cfs_rq() in migrate_task_rq_fair()
any more.

patch3-4 cleanup the migrate cgroup case by remove cpu_cgrp_subsys->fork(),
since we already do the same thing in sched_cgroup_fork().

patch1-4 have been reviewed earlier, but conflicts with the current tip
tree, so include them here as a patchset. Sorry for the inconvenience.

patch6-7 use update_load_avg() to do attach/detach after check sched_avg
last_update_time, is preparation patch for the following patches.

patch8-9 fix load tracking for new forked !fair task and when task
switched_from_fair().

After these changes, the task sched_avg last_update_time is reset to 0
when migrate from CPU/cgroup or switched_from_fair(), to save updated
sched_avg for next attach.

Thanks.

Changes in v2:
 - split task se depth maintainence into a separate patch3, suggested
   by Peter.
 - reorder patch6-7 before patch8-9, since we need update_load_avg()
   to do conditional attach/detach to avoid corner cases like twice
   attach problem.

Chengming Zhou (10):
  sched/fair: combine detach into dequeue when migrating task
  sched/fair: update comments in enqueue/dequeue_entity()
  sched/fair: maintain task se depth in set_task_rq()
  sched/fair: remove redundant cpu_cgrp_subsys->fork()
  sched/fair: reset sched_avg last_update_time before set_task_rq()
  sched/fair: delete superfluous SKIP_AGE_LOAD
  sched/fair: use update_load_avg() to attach/detach entity load_avg
  sched/fair: fix load tracking for new forked !fair task
  sched/fair: stop load tracking when task switched_from_fair()
  sched/fair: delete superfluous set_task_rq_fair()

 kernel/sched/core.c     |  27 ++------
 kernel/sched/fair.c     | 144 ++++++++++------------------------------
 kernel/sched/features.h |   1 -
 kernel/sched/sched.h    |  14 +---
 4 files changed, 41 insertions(+), 145 deletions(-)

-- 
2.36.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 01/10] sched/fair: combine detach into dequeue when migrating task
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 02/10] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

When we are migrating task out of the CPU, we can combine detach and
propagation into dequeue_entity() to save the detach_entity_cfs_rq()
in migrate_task_rq_fair().

This optimization is like combining DO_ATTACH in the enqueue_entity()
when migrating task to the CPU.

So we don't have to traverse the CFS tree extra time to do the
detach_entity_cfs_rq() -> propagate_entity_cfs_rq() call, which
wouldn't be called anymore with this patch's change.

detach_task()
  deactivate_task()
    dequeue_task_fair()
      for_each_sched_entity(se)
        dequeue_entity()
          update_load_avg() /* (1) */
            detach_entity_load_avg()

  set_task_cpu()
    migrate_task_rq_fair()
      detach_entity_cfs_rq() /* (2) */
        update_load_avg();
        detach_entity_load_avg();
        propagate_entity_cfs_rq();
          for_each_sched_entity()
            update_load_avg()

This patch save the detach_entity_cfs_rq() called in (2) by doing
the detach_entity_load_avg() for a CPU migrating task inside (1)
(the task being the first se in the loop)

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a78d2e3b9d49..0689b94ed70b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4003,6 +4003,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define UPDATE_TG	0x1
 #define SKIP_AGE_LOAD	0x2
 #define DO_ATTACH	0x4
+#define DO_DETACH	0x8
 
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -4020,7 +4021,14 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
 
-	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
+	if (flags & DO_DETACH) {
+		/*
+		 * DO_DETACH means we're here from dequeue_entity()
+		 * and we are migrating task out of the CPU.
+		 */
+		detach_entity_load_avg(cfs_rq, se);
+		update_tg_load_avg(cfs_rq);
+	} else if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
 
 		/*
 		 * DO_ATTACH means we're here from enqueue_entity().
@@ -4292,6 +4300,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
+#define DO_DETACH	0x0
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -4511,6 +4520,11 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
+	int action = UPDATE_TG;
+
+	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
+		action |= DO_DETACH;
+
 	/*
 	 * Update run-time statistics of the 'current'.
 	 */
@@ -4524,7 +4538,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	update_load_avg(cfs_rq, se, action);
 	se_update_runnable(se);
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
@@ -7076,8 +7090,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	return new_cpu;
 }
 
-static void detach_entity_cfs_rq(struct sched_entity *se);
-
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -7099,15 +7111,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
 	}
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
-		/*
-		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
-		 * rq->lock and can modify state directly.
-		 */
-		lockdep_assert_rq_held(task_rq(p));
-		detach_entity_cfs_rq(se);
-
-	} else {
+	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
 		/*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 02/10] sched/fair: update comments in enqueue/dequeue_entity()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 01/10] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq() Chengming Zhou
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

When reading the sched_avg related code, I found the comments in
enqueue/dequeue_entity() are not updated with the current code.

We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg
during enqueue/dequeue_entity(), those are done only for attach/detach.

This patch updates the comments to reflect the current code working.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0689b94ed70b..2a3e12ead144 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4443,7 +4443,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
+	 *   - For group_entity, update its runnable_weight to reflect the new
+	 *     h_nr_running of its group cfs_rq.
 	 *   - For group_entity, update its weight to reflect the new share of
 	 *     its group cfs_rq
 	 *   - Add its new weight to cfs_rq->load.weight
@@ -4533,7 +4534,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
+	 *   - For group_entity, update its runnable_weight to reflect the new
+	 *     h_nr_running of its group cfs_rq.
 	 *   - Subtract its previous weight from cfs_rq->load.weight.
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 01/10] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 02/10] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-14 12:30   ` Dietmar Eggemann
  2022-07-18  7:16   ` Vincent Guittot
  2022-07-13  4:04 ` [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

Previously we only maintain task se depth in task_move_group_fair(),
if a !fair task change task group, its se depth will not be updated,
so commit eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")
fix the problem by updating se depth in switched_to_fair() too.

This patch move task se depth maintainence to set_task_rq(), which will be
called when CPU/cgroup change, so its depth will always be correct.

This patch is preparation for the next patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c  | 8 --------
 kernel/sched/sched.h | 1 +
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a3e12ead144..bf595b622656 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11539,14 +11539,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	/*
-	 * Since the real-depth could have been changed (only FAIR
-	 * class maintain depth value), reset depth properly.
-	 */
-	se->depth = se->parent ? se->parent->depth + 1 : 0;
-#endif
-
 	/* Synchronize entity with its cfs_rq */
 	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
 	attach_entity_load_avg(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aad7f5ee9666..8cc3eb7b86cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1940,6 +1940,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
+	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (2 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq() Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-14 12:31   ` Dietmar Eggemann
  2022-07-13  4:04 ` [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().

Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
has already set task group for the new fair task in sched_cgroup_fork(),
so cpu_cgrp_subsys->fork() can be removed.

  cgroup_can_fork()	--> pin parent's sched_task_group
  sched_cgroup_fork()
    __set_task_cpu	--> set task group
  cgroup_post_fork()
    ss->fork() := cpu_cgroup_fork()	--> set again

After this patch's change, task_change_group_fair() only need to
care about task cgroup migration, make the code much simplier.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 27 ++++-----------------------
 kernel/sched/fair.c  | 23 +----------------------
 kernel/sched/sched.h |  5 +----
 3 files changed, 6 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c215b5adc707..d85fdea51e3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -481,8 +481,7 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *				p->se.load, p->rt_priority,
  *				p->dl.dl_{runtime, deadline, period, flags, bw, density}
  *  - sched_setnuma():		p->numa_preferred_nid
- *  - sched_move_task()/
- *    cpu_cgroup_fork():	p->sched_task_group
+ *  - sched_move_task():	p->sched_task_group
  *  - uclamp_update_active()	p->uclamp*
  *
  * p->state <- TASK_*:
@@ -10127,7 +10126,7 @@ void sched_release_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
-static void sched_change_group(struct task_struct *tsk, int type)
+static void sched_change_group(struct task_struct *tsk)
 {
 	struct task_group *tg;
 
@@ -10143,7 +10142,7 @@ static void sched_change_group(struct task_struct *tsk, int type)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (tsk->sched_class->task_change_group)
-		tsk->sched_class->task_change_group(tsk, type);
+		tsk->sched_class->task_change_group(tsk);
 	else
 #endif
 		set_task_rq(tsk, task_cpu(tsk));
@@ -10174,7 +10173,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (running)
 		put_prev_task(rq, tsk);
 
-	sched_change_group(tsk, TASK_MOVE_GROUP);
+	sched_change_group(tsk);
 
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
@@ -10252,23 +10251,6 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	sched_unregister_group(tg);
 }
 
-/*
- * This is called before wake_up_new_task(), therefore we really only
- * have to set its group bits, all the other stuff does not apply.
- */
-static void cpu_cgroup_fork(struct task_struct *task)
-{
-	struct rq_flags rf;
-	struct rq *rq;
-
-	rq = task_rq_lock(task, &rf);
-
-	update_rq_clock(rq);
-	sched_change_group(task, TASK_SET_GROUP);
-
-	task_rq_unlock(rq, task, &rf);
-}
-
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
@@ -11134,7 +11116,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
-	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
 	.legacy_cftypes	= cpu_legacy_files,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf595b622656..8992ce5e73d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11634,15 +11634,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void task_set_group_fair(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	set_task_rq(p, task_cpu(p));
-	se->depth = se->parent ? se->parent->depth + 1 : 0;
-}
-
-static void task_move_group_fair(struct task_struct *p)
+static void task_change_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
 	set_task_rq(p, task_cpu(p));
@@ -11654,19 +11646,6 @@ static void task_move_group_fair(struct task_struct *p)
 	attach_task_cfs_rq(p);
 }
 
-static void task_change_group_fair(struct task_struct *p, int type)
-{
-	switch (type) {
-	case TASK_SET_GROUP:
-		task_set_group_fair(p);
-		break;
-
-	case TASK_MOVE_GROUP:
-		task_move_group_fair(p);
-		break;
-	}
-}
-
 void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8cc3eb7b86cd..19e0076e4245 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2203,11 +2203,8 @@ struct sched_class {
 
 	void (*update_curr)(struct rq *rq);
 
-#define TASK_SET_GROUP		0
-#define TASK_MOVE_GROUP		1
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	void (*task_change_group)(struct task_struct *p, int type);
+	void (*task_change_group)(struct task_struct *p);
 #endif
 };
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (3 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-14 12:31   ` Dietmar Eggemann
  2022-07-19  8:49   ` Vincent Guittot
  2022-07-13  4:04 ` [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
task's sched_avg when migrate, which is not needed for already detached
task.

task_change_group_fair() will detached the task sched_avg from prev cfs_rq
first, so reset sched_avg last_update_time before set_task_rq() to avoid that.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8992ce5e73d2..171bc22bc142 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11637,12 +11637,12 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 static void task_change_group_fair(struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
-	set_task_rq(p, task_cpu(p));
 
 #ifdef CONFIG_SMP
 	/* Tell se's cfs_rq has been changed -- migrated */
 	p->se.avg.last_update_time = 0;
 #endif
+	set_task_rq(p, task_cpu(p));
 	attach_task_cfs_rq(p);
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (4 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-14 12:33   ` Dietmar Eggemann
  2022-07-13  4:04 ` [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg Chengming Zhou
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

There are three types of attach_entity_cfs_rq():

1. task migrate to CPU
2. task move to cgroup
3. task switched to fair from !fair

The case1 and case2 already have sched_avg last_update_time
reset to 0 when attach_entity_cfs_rq().

We will make case3 also have last_update_time reset to 0 when
attach_entity_cfs_rq() in the following patches.

So it makes no difference whether SKIP_AGE_LOAD is set or not.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 171bc22bc142..29811869c1fe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4001,9 +4001,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
  * Optional action to be done while updating the load average
  */
 #define UPDATE_TG	0x1
-#define SKIP_AGE_LOAD	0x2
-#define DO_ATTACH	0x4
-#define DO_DETACH	0x8
+#define DO_ATTACH	0x2
+#define DO_DETACH	0x4
 
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -4015,7 +4014,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+	if (se->avg.last_update_time)
 		__update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
@@ -4298,7 +4297,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 }
 
 #define UPDATE_TG	0x0
-#define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
 #define DO_DETACH	0x0
 
@@ -11540,7 +11538,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/* Synchronize entity with its cfs_rq */
-	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
+	update_load_avg(cfs_rq, se, 0);
 	attach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq);
 	propagate_entity_cfs_rq(se);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (5 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-15 11:18   ` Dietmar Eggemann
  2022-07-13  4:04 ` [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
use update_load_avg() to implement attach/detach entity load_avg.

Another advantage of using update_load_avg() is that it will check
last_update_time before attach or detach, instead of unconditional
attach/detach in the current code.

This way can avoid some corner problematic cases of load tracking,
like twice attach problem, detach unattached NEW task problem.

1. switch to fair class (twice attach problem)

p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
if (queued)
  enqueue_task(p);
    ...
      enqueue_entity()
        update_load_avg(UPDATE_TG | DO_ATTACH)
          if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
            attach_entity_load_avg()  --> attached, will set last_update_time
check_class_changed()
  switched_from() (!fair)
  switched_to()   (fair)
    switched_to_fair()
      attach_entity_load_avg()  --> unconditional attach again!

2. change cgroup of NEW task (detach unattached task problem)

sched_move_group(p)
  if (queued)
    dequeue_task()
  task_move_group_fair()
    detach_task_cfs_rq()
      detach_entity_load_avg()  --> detach unattached NEW task
    set_task_rq()
    attach_task_cfs_rq()
      attach_entity_load_avg()
  if (queued)
    enqueue_task()

These problems have been fixed in commit 7dc603c9028e
("sched/fair: Fix PELT integrity for new tasks"), which also
bring its own problems.

First, it add a new task state TASK_NEW and an unnessary limitation
that we would fail when change the cgroup of TASK_NEW tasks.

Second, it attach entity load_avg in post_init_entity_util_avg(),
in which we only set sched_avg last_update_time for !fair tasks,
will cause PELT integrity problem when switched_to_fair().

This patch make update_load_avg() the only location of attach/detach,
and can handle these corner cases like change cgroup of NEW tasks,
by checking last_update_time before attach/detach.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29811869c1fe..51fc20c161a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4307,11 +4307,6 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
 
-static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void
-detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-
 static inline int newidle_balance(struct rq *rq, struct rq_flags *rf)
 {
 	return 0;
@@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_load_avg(cfs_rq, se, 0);
-	detach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq);
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
 	propagate_entity_cfs_rq(se);
 }
 
@@ -11537,10 +11530,8 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	/* Synchronize entity with its cfs_rq */
-	update_load_avg(cfs_rq, se, 0);
-	attach_entity_load_avg(cfs_rq, se);
-	update_tg_load_avg(cfs_rq);
+	/* Synchronize entity with its cfs_rq and attach our load */
+	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	propagate_entity_cfs_rq(se);
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (6 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-19 12:35   ` Vincent Guittot
  2022-07-13  4:04 ` [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
  2022-07-13  4:04 ` [PATCH v2 10/10] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou
  9 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

New forked !fair task will set its sched_avg last_update_time to
the pelt_clock of cfs_rq, after a while in switched_to_fair():

switched_to_fair
  attach_task_cfs_rq
    attach_entity_cfs_rq
      update_load_avg
        __update_load_avg_se(now, cfs_rq, se)

the delta (now - sa->last_update_time) will contribute/decay sched_avg
depends on the task running/runnable status at that time.

This patch don't set sched_avg last_update_time of new forked !fair
task, leave it to 0. So later in update_load_avg(), we don't need to
contribute/decay the wrong delta (now - sa->last_update_time).

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51fc20c161a3..50f65a2ede32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -849,22 +849,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 
 	sa->runnable_avg = sa->util_avg;
 
-	if (p->sched_class != &fair_sched_class) {
-		/*
-		 * For !fair tasks do:
-		 *
-		update_cfs_rq_load_avg(now, cfs_rq);
-		attach_entity_load_avg(cfs_rq, se);
-		switched_from_fair(rq, p);
-		 *
-		 * such that the next switched_to_fair() has the
-		 * expected state.
-		 */
-		se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
-		return;
-	}
-
-	attach_entity_cfs_rq(se);
+	if (p->sched_class == &fair_sched_class)
+		attach_entity_cfs_rq(se);
 }
 
 #else /* !CONFIG_SMP */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (7 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  2022-07-14 12:33   ` Dietmar Eggemann
  2022-07-19 13:20   ` Vincent Guittot
  2022-07-13  4:04 ` [PATCH v2 10/10] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou
  9 siblings, 2 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

The same reason as the previous commit, if we don't reset the
sched_avg last_update_time to 0, after a while in switched_to_fair():

switched_to_fair
  attach_task_cfs_rq
    attach_entity_cfs_rq
      update_load_avg
        __update_load_avg_se(now, cfs_rq, se)

The delta (now - sa->last_update_time) will wrongly contribute/decay
sched_avg depends on the task running/runnable status at that time.

This patch reset it's sched_avg last_update_time to 0, stop load
tracking for !fair task, later in switched_to_fair() ->
update_load_avg(), we can use its saved sched_avg.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50f65a2ede32..576028f5a09e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+
+#ifdef CONFIG_SMP
+	/* Stop load tracking for !fair task */
+	p->se.avg.last_update_time = 0;
+#endif
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 10/10] sched/fair: delete superfluous set_task_rq_fair()
  2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
                   ` (8 preceding siblings ...)
  2022-07-13  4:04 ` [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
@ 2022-07-13  4:04 ` Chengming Zhou
  9 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-13  4:04 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, vschneid
  Cc: linux-kernel, Chengming Zhou

set_task_rq() is used when move task across CPUs/groups to change
its cfs_rq and parent entity, and it will call set_task_rq_fair()
to sync blocked task load_avg just before change its cfs_rq.

1. task migrate CPU: will detach/remove from prev cfs_rq and reset
   its sched_avg last_update_time to 0, so don't need to sync again.

2. task migrate cgroup: will detach from prev cfs_rq and reset its
   sched_avg last_update_time to 0, so don't need to sync too.

3. !fair task migrate CPU/cgroup: we stop load tracking for !fair task,
   reset sched_avg last_update_time to 0 when switched_from_fair(), so
   don't need it too.

So set_task_rq_fair() is not needed anymore, this patch delete it.
And delete unused ATTACH_AGE_LOAD feature together.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/fair.c     | 31 -------------------------------
 kernel/sched/features.h |  1 -
 kernel/sched/sched.h    |  8 --------
 3 files changed, 40 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 576028f5a09e..b435eda88468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3430,37 +3430,6 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	}
 }
 
-/*
- * Called within set_task_rq() right before setting a task's CPU. The
- * caller only guarantees p->pi_lock is held; no other assumptions,
- * including the state of rq->lock, should be made.
- */
-void set_task_rq_fair(struct sched_entity *se,
-		      struct cfs_rq *prev, struct cfs_rq *next)
-{
-	u64 p_last_update_time;
-	u64 n_last_update_time;
-
-	if (!sched_feat(ATTACH_AGE_LOAD))
-		return;
-
-	/*
-	 * We are supposed to update the task to "current" time, then its up to
-	 * date and ready to go to new CPU/cfs_rq. But we have difficulty in
-	 * getting what current time is, so simply throw away the out-of-date
-	 * time. This will result in the wakee task is less decayed, but giving
-	 * the wakee more load sounds not bad.
-	 */
-	if (!(se->avg.last_update_time && prev))
-		return;
-
-	p_last_update_time = cfs_rq_last_update_time(prev);
-	n_last_update_time = cfs_rq_last_update_time(next);
-
-	__update_load_avg_blocked_se(p_last_update_time, se);
-	se->avg.last_update_time = n_last_update_time;
-}
-
 /*
  * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
  * propagate its contribution. The key to this propagation is the invariant
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..fb92431d496f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,7 +85,6 @@ SCHED_FEAT(RT_PUSH_IPI, true)
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)
 SCHED_FEAT(LB_MIN, false)
-SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
 SCHED_FEAT(WA_IDLE, true)
 SCHED_FEAT(WA_WEIGHT, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19e0076e4245..a8ec7af4bd51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -505,13 +505,6 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
-#ifdef CONFIG_SMP
-extern void set_task_rq_fair(struct sched_entity *se,
-			     struct cfs_rq *prev, struct cfs_rq *next);
-#else /* !CONFIG_SMP */
-static inline void set_task_rq_fair(struct sched_entity *se,
-			     struct cfs_rq *prev, struct cfs_rq *next) { }
-#endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #else /* CONFIG_CGROUP_SCHED */
@@ -1937,7 +1930,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
 	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq()
  2022-07-13  4:04 ` [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq() Chengming Zhou
@ 2022-07-14 12:30   ` Dietmar Eggemann
  2022-07-14 13:03     ` [External] " Chengming Zhou
  2022-07-18  7:16   ` Vincent Guittot
  1 sibling, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-14 12:30 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> Previously we only maintain task se depth in task_move_group_fair(),
> if a !fair task change task group, its se depth will not be updated,
> so commit eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")
> fix the problem by updating se depth in switched_to_fair() too.

Maybe it's worth mentioning how the se.depth setting from
task_move_group_fair() and switched_to_fair() went into
attach_task_cfs_rq() with commit daa59407b558 ("sched/fair: Unify
switched_{from,to}_fair() and task_move_group_fair()"}  and further into
attach_entity_cfs_rq() with commit df217913e72e ("sched/fair: Factorize
attach/detach entity").

> This patch move task se depth maintainence to set_task_rq(), which will be
> called when CPU/cgroup change, so its depth will always be correct.
> 
> This patch is preparation for the next patch.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> ---
>  kernel/sched/fair.c  | 8 --------
>  kernel/sched/sched.h | 1 +
>  2 files changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2a3e12ead144..bf595b622656 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11539,14 +11539,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -	/*
> -	 * Since the real-depth could have been changed (only FAIR
> -	 * class maintain depth value), reset depth properly.
> -	 */
> -	se->depth = se->parent ? se->parent->depth + 1 : 0;
> -#endif
> -
>  	/* Synchronize entity with its cfs_rq */
>  	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
>  	attach_entity_load_avg(cfs_rq, se);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aad7f5ee9666..8cc3eb7b86cd 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1940,6 +1940,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>  	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
>  	p->se.cfs_rq = tg->cfs_rq[cpu];
>  	p->se.parent = tg->se[cpu];
> +	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
>  #endif
>  
>  #ifdef CONFIG_RT_GROUP_SCHED


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-13  4:04 ` [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
@ 2022-07-14 12:31   ` Dietmar Eggemann
  2022-07-14 13:06     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-14 12:31 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
> in cgroup_post_fork().
> 
> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
> has already set task group for the new fair task in sched_cgroup_fork(),
> so cpu_cgrp_subsys->fork() can be removed.
> 
>   cgroup_can_fork()	--> pin parent's sched_task_group
>   sched_cgroup_fork()

Don't we set the task group (`p->sched_task_group = tg`) now directly in
sched_cgroup_fork() and not in __set_task_cpu()?

>     __set_task_cpu	--> set task group
>   cgroup_post_fork()
>     ss->fork() := cpu_cgroup_fork()	--> set again

nit pick:

cpu_cgroup_fork() -> sched_change_group(..., TASK_SET_GROUP)

> After this patch's change, task_change_group_fair() only need to
> care about task cgroup migration, make the code much simplier.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

LGTM too.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq()
  2022-07-13  4:04 ` [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
@ 2022-07-14 12:31   ` Dietmar Eggemann
  2022-07-19  8:49   ` Vincent Guittot
  1 sibling, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-14 12:31 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
> task's sched_avg when migrate, which is not needed for already detached
> task.
> 
> task_change_group_fair() will detached the task sched_avg from prev cfs_rq
> first, so reset sched_avg last_update_time before set_task_rq() to avoid that.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8992ce5e73d2..171bc22bc142 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11637,12 +11637,12 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  static void task_change_group_fair(struct task_struct *p)
>  {
>  	detach_task_cfs_rq(p);
> -	set_task_rq(p, task_cpu(p));
>  
>  #ifdef CONFIG_SMP
>  	/* Tell se's cfs_rq has been changed -- migrated */
>  	p->se.avg.last_update_time = 0;
>  #endif
> +	set_task_rq(p, task_cpu(p));
>  	attach_task_cfs_rq(p);
>  }
>  


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD
  2022-07-13  4:04 ` [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
@ 2022-07-14 12:33   ` Dietmar Eggemann
  2022-07-14 13:24     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-14 12:33 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> There are three types of attach_entity_cfs_rq():
> 
> 1. task migrate to CPU
> 2. task move to cgroup
> 3. task switched to fair from !fair
> 
> The case1 and case2 already have sched_avg last_update_time
> reset to 0 when attach_entity_cfs_rq().
> 
> We will make case3 also have last_update_time reset to 0 when
> attach_entity_cfs_rq() in the following patches.
> 
> So it makes no difference whether SKIP_AGE_LOAD is set or not.

Wouldn't this patch make more sense after 09/10?

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-13  4:04 ` [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
@ 2022-07-14 12:33   ` Dietmar Eggemann
  2022-07-14 13:43     ` [External] " Chengming Zhou
  2022-07-19 13:20   ` Vincent Guittot
  1 sibling, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-14 12:33 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> The same reason as the previous commit, if we don't reset the
> sched_avg last_update_time to 0, after a while in switched_to_fair():
> 
> switched_to_fair
>   attach_task_cfs_rq
>     attach_entity_cfs_rq
>       update_load_avg
>         __update_load_avg_se(now, cfs_rq, se)
> 
> The delta (now - sa->last_update_time) will wrongly contribute/decay
> sched_avg depends on the task running/runnable status at that time.

IMHO, a queued !fair task when switching back to fair will already be
enqueued (attached) as a fair task in __sched_setscheduler() prior to
the check_class_changed() call.

I can't see how this will work with your proposed change in using
last_update_time=0 for fair->!fair->fair class changes?

> This patch reset it's sched_avg last_update_time to 0, stop load
> tracking for !fair task, later in switched_to_fair() ->
> update_load_avg(), we can use its saved sched_avg.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50f65a2ede32..576028f5a09e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>  {
>  	detach_task_cfs_rq(p);
> +
> +#ifdef CONFIG_SMP
> +	/* Stop load tracking for !fair task */
> +	p->se.avg.last_update_time = 0;
> +#endif
>  }
>  
>  static void switched_to_fair(struct rq *rq, struct task_struct *p)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq()
  2022-07-14 12:30   ` Dietmar Eggemann
@ 2022-07-14 13:03     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-14 13:03 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/14 20:30, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> Previously we only maintain task se depth in task_move_group_fair(),
>> if a !fair task change task group, its se depth will not be updated,
>> so commit eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")
>> fix the problem by updating se depth in switched_to_fair() too.
> 
> Maybe it's worth mentioning how the se.depth setting from
> task_move_group_fair() and switched_to_fair() went into
> attach_task_cfs_rq() with commit daa59407b558 ("sched/fair: Unify
> switched_{from,to}_fair() and task_move_group_fair()"}  and further into
> attach_entity_cfs_rq() with commit df217913e72e ("sched/fair: Factorize
> attach/detach entity").
>

Good point, I will add this part in the next version.

Thanks for your review!

>> This patch move task se depth maintainence to set_task_rq(), which will be
>> called when CPU/cgroup change, so its depth will always be correct.
>>
>> This patch is preparation for the next patch.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
>> ---
>>  kernel/sched/fair.c  | 8 --------
>>  kernel/sched/sched.h | 1 +
>>  2 files changed, 1 insertion(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2a3e12ead144..bf595b622656 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11539,14 +11539,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>>  {
>>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>  
>> -#ifdef CONFIG_FAIR_GROUP_SCHED
>> -	/*
>> -	 * Since the real-depth could have been changed (only FAIR
>> -	 * class maintain depth value), reset depth properly.
>> -	 */
>> -	se->depth = se->parent ? se->parent->depth + 1 : 0;
>> -#endif
>> -
>>  	/* Synchronize entity with its cfs_rq */
>>  	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
>>  	attach_entity_load_avg(cfs_rq, se);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index aad7f5ee9666..8cc3eb7b86cd 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1940,6 +1940,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>>  	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
>>  	p->se.cfs_rq = tg->cfs_rq[cpu];
>>  	p->se.parent = tg->se[cpu];
>> +	p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
>>  #endif
>>  
>>  #ifdef CONFIG_RT_GROUP_SCHED
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork()
  2022-07-14 12:31   ` Dietmar Eggemann
@ 2022-07-14 13:06     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-14 13:06 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/14 20:31, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> We use cpu_cgrp_subsys->fork() to set task group for the new fair task
>> in cgroup_post_fork().
>>
>> Since commit b1e8206582f9 ("sched: Fix yet more sched_fork() races")
>> has already set task group for the new fair task in sched_cgroup_fork(),
>> so cpu_cgrp_subsys->fork() can be removed.
>>
>>   cgroup_can_fork()	--> pin parent's sched_task_group
>>   sched_cgroup_fork()
> 
> Don't we set the task group (`p->sched_task_group = tg`) now directly in
> sched_cgroup_fork() and not in __set_task_cpu()?

Correct, it's my fault, what I want to say is "__set_task_cpu() --> set_task_rq()".
I will fix it next version.

> 
>>     __set_task_cpu	--> set task group
>>   cgroup_post_fork()
>>     ss->fork() := cpu_cgroup_fork()	--> set again
> 
> nit pick:
> 
> cpu_cgroup_fork() -> sched_change_group(..., TASK_SET_GROUP)

Ok, will add this.

Thanks.

> 
>> After this patch's change, task_change_group_fair() only need to
>> care about task cgroup migration, make the code much simplier.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> LGTM too.
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD
  2022-07-14 12:33   ` Dietmar Eggemann
@ 2022-07-14 13:24     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-14 13:24 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/14 20:33, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> There are three types of attach_entity_cfs_rq():
>>
>> 1. task migrate to CPU
>> 2. task move to cgroup
>> 3. task switched to fair from !fair
>>
>> The case1 and case2 already have sched_avg last_update_time
>> reset to 0 when attach_entity_cfs_rq().
>>
>> We will make case3 also have last_update_time reset to 0 when
>> attach_entity_cfs_rq() in the following patches.
>>
>> So it makes no difference whether SKIP_AGE_LOAD is set or not.
> 
> Wouldn't this patch make more sense after 09/10?
> 

Yes, this patch was put at last in the v1.

In this v2, the following patch 07/10 make attach_entity_cfs_rq() to use
update_load_avg(DO_ATTACH) to do conditional attach.

So I want to get rid of "sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD"
earlier in this patch, or have to write like this:


  int flags = UPDATE_TG | DO_ATTACH;

  update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? flags : SKIP_AGE_LOAD | flags);


Thanks.

> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-14 12:33   ` Dietmar Eggemann
@ 2022-07-14 13:43     ` Chengming Zhou
  2022-07-15 11:15       ` Dietmar Eggemann
  0 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-14 13:43 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/14 20:33, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> The same reason as the previous commit, if we don't reset the
>> sched_avg last_update_time to 0, after a while in switched_to_fair():
>>
>> switched_to_fair
>>   attach_task_cfs_rq
>>     attach_entity_cfs_rq
>>       update_load_avg
>>         __update_load_avg_se(now, cfs_rq, se)
>>
>> The delta (now - sa->last_update_time) will wrongly contribute/decay
>> sched_avg depends on the task running/runnable status at that time.
> 
> IMHO, a queued !fair task when switching back to fair will already be
> enqueued (attached) as a fair task in __sched_setscheduler() prior to
> the check_class_changed() call.

Right, this is true for a queued !fair task, it will enqueued (attached) before
check_class_changed().

enqueue_task_fair()
  enqueue_entity()
    update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
      if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
        __update_load_avg_se(now, cfs_rq, se)  --> (1)
check_class_changed()
  switched_to_fair()
    attach_task_cfs_rq()
      attach_entity_cfs_rq()
        update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD)
          if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
            __update_load_avg_se(now, cfs_rq, se)  --> (2)


1. for queued !fair: (1) delta = (now - last_update_time), last_update_time is the time
   when switched_from_fair().

2. for !queued !fair: (2) delta = (now - last_update_time), last_update_time is the time
   when switched_from_fair().

The scenario in the commit message only cover !queued !fair case, I forget the queued !fair
case, their problem is the same.


> 
> I can't see how this will work with your proposed change in using
> last_update_time=0 for fair->!fair->fair class changes?
If we reset last_update_time=0 for !fair task, then:

1. for queued !fair: will not do (1) since the if condition is false.

2. for !queued !fair: will not do (2) since the if condition is false.

Thanks.

> 
>> This patch reset it's sched_avg last_update_time to 0, stop load
>> tracking for !fair task, later in switched_to_fair() ->
>> update_load_avg(), we can use its saved sched_avg.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/fair.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 50f65a2ede32..576028f5a09e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>>  {
>>  	detach_task_cfs_rq(p);
>> +
>> +#ifdef CONFIG_SMP
>> +	/* Stop load tracking for !fair task */
>> +	p->se.avg.last_update_time = 0;
>> +#endif
>>  }
>>  
>>  static void switched_to_fair(struct rq *rq, struct task_struct *p)
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-14 13:43     ` [External] " Chengming Zhou
@ 2022-07-15 11:15       ` Dietmar Eggemann
  2022-07-15 16:35         ` Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-15 11:15 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 14/07/2022 15:43, Chengming Zhou wrote:
> On 2022/7/14 20:33, Dietmar Eggemann wrote:
>> On 13/07/2022 06:04, Chengming Zhou wrote:

[...]

>> IMHO, a queued !fair task when switching back to fair will already be
>> enqueued (attached) as a fair task in __sched_setscheduler() prior to
>> the check_class_changed() call.
> 
> Right, this is true for a queued !fair task, it will enqueued (attached) before
> check_class_changed().
> 
> enqueue_task_fair()
>   enqueue_entity()
>     update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
>       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>         __update_load_avg_se(now, cfs_rq, se)  --> (1)
> check_class_changed()
>   switched_to_fair()
>     attach_task_cfs_rq()
>       attach_entity_cfs_rq()
>         update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD)
>           if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>             __update_load_avg_se(now, cfs_rq, se)  --> (2)
> 
> 
> 1. for queued !fair: (1) delta = (now - last_update_time), last_update_time is the time
>    when switched_from_fair().
> 
> 2. for !queued !fair: (2) delta = (now - last_update_time), last_update_time is the time
>    when switched_from_fair().
> 
> The scenario in the commit message only cover !queued !fair case, I forget the queued !fair
> case, their problem is the same.

OK, that makes sense to me then.

>> I can't see how this will work with your proposed change in using
>> last_update_time=0 for fair->!fair->fair class changes?
> If we reset last_update_time=0 for !fair task, then:
> 
> 1. for queued !fair: will not do (1) since the if condition is false.
> 
> 2. for !queued !fair: will not do (2) since the if condition is false.
OK.

[...]

>>> This patch reset it's sched_avg last_update_time to 0, stop load
>>> tracking for !fair task, later in switched_to_fair() ->
>>> update_load_avg(), we can use its saved sched_avg.
>>>
>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>> ---
>>>  kernel/sched/fair.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 50f65a2ede32..576028f5a09e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>>>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>>>  {
>>>  	detach_task_cfs_rq(p);
>>> +
>>> +#ifdef CONFIG_SMP
>>> +	/* Stop load tracking for !fair task */

You're not really stopping p->se load tracking by doing this. We don't
do this outside fair anyway. IMHO, you freeze p->se's PELT _avg/_sum
values to be used as initial values when re-entering fair.

>>> +	p->se.avg.last_update_time = 0;
>>> +#endif
>>>  }
>>>  
>>>  static void switched_to_fair(struct rq *rq, struct task_struct *p)


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-13  4:04 ` [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg Chengming Zhou
@ 2022-07-15 11:18   ` Dietmar Eggemann
  2022-07-15 16:21     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-15 11:18 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 13/07/2022 06:04, Chengming Zhou wrote:
> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
> use update_load_avg() to implement attach/detach entity load_avg.
> 
> Another advantage of using update_load_avg() is that it will check
> last_update_time before attach or detach, instead of unconditional
> attach/detach in the current code.
> 
> This way can avoid some corner problematic cases of load tracking,
> like twice attach problem, detach unattached NEW task problem.

This explanation is somewhat hard to follow for me. Since both issues
have been fixed already (you mention this further below) you're saying
that with you change you don't reintroduce them?

> 1. switch to fair class (twice attach problem)
> 
> p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
> if (queued)
>   enqueue_task(p);
>     ...
>       enqueue_entity()
>         update_load_avg(UPDATE_TG | DO_ATTACH)
>           if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
>             attach_entity_load_avg()  --> attached, will set last_update_time
> check_class_changed()
>   switched_from() (!fair)
>   switched_to()   (fair)
>     switched_to_fair()
>       attach_entity_load_avg()  --> unconditional attach again!
> 
> 2. change cgroup of NEW task (detach unattached task problem)
> 
> sched_move_group(p)
>   if (queued)
>     dequeue_task()
>   task_move_group_fair()
>     detach_task_cfs_rq()
>       detach_entity_load_avg()  --> detach unattached NEW task
>     set_task_rq()
>     attach_task_cfs_rq()
>       attach_entity_load_avg()
>   if (queued)
>     enqueue_task()
> 
> These problems have been fixed in commit 7dc603c9028e
> ("sched/fair: Fix PELT integrity for new tasks"), which also
> bring its own problems.
> 
> First, it add a new task state TASK_NEW and an unnessary limitation
> that we would fail when change the cgroup of TASK_NEW tasks.
> 
> Second, it attach entity load_avg in post_init_entity_util_avg(),
> in which we only set sched_avg last_update_time for !fair tasks,
> will cause PELT integrity problem when switched_to_fair().

I guess those PELT integrity problems are less severe since we have the
enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
we always decay the time the task spend outside fair.

Looks to me that you want to replace this by your `freeze PELT when
outside fair` model.

> This patch make update_load_avg() the only location of attach/detach,
> and can handle these corner cases like change cgroup of NEW tasks,
> by checking last_update_time before attach/detach.

[...]

> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  	/* Catch up with the cfs_rq and remove our load when we leave */
> -	update_load_avg(cfs_rq, se, 0);
> -	detach_entity_load_avg(cfs_rq, se);
> -	update_tg_load_avg(cfs_rq);
> +	update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);

IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
updated in this case.

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-15 11:18   ` Dietmar Eggemann
@ 2022-07-15 16:21     ` Chengming Zhou
  2022-07-19 10:29       ` Vincent Guittot
  2022-07-19 15:02       ` Dietmar Eggemann
  0 siblings, 2 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-15 16:21 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/15 19:18, Dietmar Eggemann wrote:
> On 13/07/2022 06:04, Chengming Zhou wrote:
>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>> use update_load_avg() to implement attach/detach entity load_avg.
>>
>> Another advantage of using update_load_avg() is that it will check
>> last_update_time before attach or detach, instead of unconditional
>> attach/detach in the current code.
>>
>> This way can avoid some corner problematic cases of load tracking,
>> like twice attach problem, detach unattached NEW task problem.
> 
> This explanation is somewhat hard to follow for me. Since both issues
> have been fixed already (you mention this further below) you're saying
> that with you change you don't reintroduce them?

Sorry for this not very clear explanation.

Yes, both issues have been fixed already, what I want to say is that bugfix
brings its own problem and limitation mentioned below.

So I want to use another way to solve these problems better.

> 
>> 1. switch to fair class (twice attach problem)
>>
>> p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
>> if (queued)
>>   enqueue_task(p);
>>     ...
>>       enqueue_entity()
>>         update_load_avg(UPDATE_TG | DO_ATTACH)
>>           if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
>>             attach_entity_load_avg()  --> attached, will set last_update_time
>> check_class_changed()
>>   switched_from() (!fair)
>>   switched_to()   (fair)
>>     switched_to_fair()
>>       attach_entity_load_avg()  --> unconditional attach again!
>>
>> 2. change cgroup of NEW task (detach unattached task problem)
>>
>> sched_move_group(p)
>>   if (queued)
>>     dequeue_task()
>>   task_move_group_fair()
>>     detach_task_cfs_rq()
>>       detach_entity_load_avg()  --> detach unattached NEW task
>>     set_task_rq()
>>     attach_task_cfs_rq()
>>       attach_entity_load_avg()
>>   if (queued)
>>     enqueue_task()
>>
>> These problems have been fixed in commit 7dc603c9028e
>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>> bring its own problems.
>>
>> First, it add a new task state TASK_NEW and an unnessary limitation
>> that we would fail when change the cgroup of TASK_NEW tasks.

This is the limitation that bugfix has brought.

We can't change cgroup or switch to fair for task with last_update_time=0
if we don't have conditional detach/attach.

So we have to:

1. !fair task also need to set last_update_time.
2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.

>>
>> Second, it attach entity load_avg in post_init_entity_util_avg(),
>> in which we only set sched_avg last_update_time for !fair tasks,
>> will cause PELT integrity problem when switched_to_fair().
> 
> I guess those PELT integrity problems are less severe since we have the
> enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
> we always decay the time the task spend outside fair.

enqueue_task_fair()
  enqueue_entity()
    update_load_avg()
      if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
        __update_load_avg_se(now, cfs_rq, se);  --> (1)

We can see above, for queued !fair task, (1) will deay the delta time
(now - se.avg.last_update_time) even for a NEW !fair task.

> 
> Looks to me that you want to replace this by your `freeze PELT when
> outside fair` model.

Yes, want to freeze PELT for two !fair cases:

1. !fair task hasn't been fair before: will still have its init load_avg
   when switch to fair.

2. !fair task has been switched_from_fair(): will still keep its lastest
   load_avg when switch to fair.

> 
>> This patch make update_load_avg() the only location of attach/detach,
>> and can handle these corner cases like change cgroup of NEW tasks,
>> by checking last_update_time before attach/detach.
> 
> [...]
> 
>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>  
>>  	/* Catch up with the cfs_rq and remove our load when we leave */
>> -	update_load_avg(cfs_rq, se, 0);
>> -	detach_entity_load_avg(cfs_rq, se);
>> -	update_tg_load_avg(cfs_rq);
>> +	update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> 
> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> updated in this case.

Correct, will do.

Thanks.

> 
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-15 11:15       ` Dietmar Eggemann
@ 2022-07-15 16:35         ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-15 16:35 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/15 19:15, Dietmar Eggemann wrote:
> On 14/07/2022 15:43, Chengming Zhou wrote:
>> On 2022/7/14 20:33, Dietmar Eggemann wrote:
>>> On 13/07/2022 06:04, Chengming Zhou wrote:
> 
> [...]
> 
>>> IMHO, a queued !fair task when switching back to fair will already be
>>> enqueued (attached) as a fair task in __sched_setscheduler() prior to
>>> the check_class_changed() call.
>>
>> Right, this is true for a queued !fair task, it will enqueued (attached) before
>> check_class_changed().
>>
>> enqueue_task_fair()
>>   enqueue_entity()
>>     update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
>>       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>>         __update_load_avg_se(now, cfs_rq, se)  --> (1)
>> check_class_changed()
>>   switched_to_fair()
>>     attach_task_cfs_rq()
>>       attach_entity_cfs_rq()
>>         update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD)
>>           if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>>             __update_load_avg_se(now, cfs_rq, se)  --> (2)
>>
>>
>> 1. for queued !fair: (1) delta = (now - last_update_time), last_update_time is the time
>>    when switched_from_fair().
>>
>> 2. for !queued !fair: (2) delta = (now - last_update_time), last_update_time is the time
>>    when switched_from_fair().
>>
>> The scenario in the commit message only cover !queued !fair case, I forget the queued !fair
>> case, their problem is the same.
> 
> OK, that makes sense to me then.
> 
>>> I can't see how this will work with your proposed change in using
>>> last_update_time=0 for fair->!fair->fair class changes?
>> If we reset last_update_time=0 for !fair task, then:
>>
>> 1. for queued !fair: will not do (1) since the if condition is false.
>>
>> 2. for !queued !fair: will not do (2) since the if condition is false.
> OK.
> 
> [...]
> 
>>>> This patch reset it's sched_avg last_update_time to 0, stop load
>>>> tracking for !fair task, later in switched_to_fair() ->
>>>> update_load_avg(), we can use its saved sched_avg.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>> ---
>>>>  kernel/sched/fair.c | 5 +++++
>>>>  1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 50f65a2ede32..576028f5a09e 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>>>>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>>>>  {
>>>>  	detach_task_cfs_rq(p);
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	/* Stop load tracking for !fair task */
> 
> You're not really stopping p->se load tracking by doing this. We don't
> do this outside fair anyway. IMHO, you freeze p->se's PELT _avg/_sum
> values to be used as initial values when re-entering fair.
> 

Yes, you are right, this comment is misleading and wrong, will delete it.

Thanks very much for your review!


>>>> +	p->se.avg.last_update_time = 0;
>>>> +#endif
>>>>  }
>>>>  
>>>>  static void switched_to_fair(struct rq *rq, struct task_struct *p)
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq()
  2022-07-13  4:04 ` [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq() Chengming Zhou
  2022-07-14 12:30   ` Dietmar Eggemann
@ 2022-07-18  7:16   ` Vincent Guittot
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2022-07-18  7:16 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On Wed, 13 Jul 2022 at 06:04, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Previously we only maintain task se depth in task_move_group_fair(),
> if a !fair task change task group, its se depth will not be updated,
> so commit eb7a59b2c888 ("sched/fair: Reset se-depth when task switched to FAIR")
> fix the problem by updating se depth in switched_to_fair() too.
>
> This patch move task se depth maintainence to set_task_rq(), which will be
> called when CPU/cgroup change, so its depth will always be correct.
>
> This patch is preparation for the next patch.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c  | 8 --------
>  kernel/sched/sched.h | 1 +
>  2 files changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2a3e12ead144..bf595b622656 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11539,14 +11539,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       /*
> -        * Since the real-depth could have been changed (only FAIR
> -        * class maintain depth value), reset depth properly.
> -        */
> -       se->depth = se->parent ? se->parent->depth + 1 : 0;
> -#endif
> -
>         /* Synchronize entity with its cfs_rq */
>         update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
>         attach_entity_load_avg(cfs_rq, se);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aad7f5ee9666..8cc3eb7b86cd 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1940,6 +1940,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>         set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
>         p->se.cfs_rq = tg->cfs_rq[cpu];
>         p->se.parent = tg->se[cpu];
> +       p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq()
  2022-07-13  4:04 ` [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
  2022-07-14 12:31   ` Dietmar Eggemann
@ 2022-07-19  8:49   ` Vincent Guittot
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2022-07-19  8:49 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On Wed, 13 Jul 2022 at 06:05, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
> task's sched_avg when migrate, which is not needed for already detached
> task.
>
> task_change_group_fair() will detached the task sched_avg from prev cfs_rq
> first, so reset sched_avg last_update_time before set_task_rq() to avoid that.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8992ce5e73d2..171bc22bc142 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11637,12 +11637,12 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  static void task_change_group_fair(struct task_struct *p)
>  {
>         detach_task_cfs_rq(p);
> -       set_task_rq(p, task_cpu(p));
>
>  #ifdef CONFIG_SMP
>         /* Tell se's cfs_rq has been changed -- migrated */
>         p->se.avg.last_update_time = 0;
>  #endif
> +       set_task_rq(p, task_cpu(p));
>         attach_task_cfs_rq(p);
>  }
>
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-15 16:21     ` [External] " Chengming Zhou
@ 2022-07-19 10:29       ` Vincent Guittot
  2022-07-20 13:40         ` Chengming Zhou
  2022-07-19 15:02       ` Dietmar Eggemann
  1 sibling, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2022-07-19 10:29 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Dietmar Eggemann, mingo, peterz, rostedt, bsegall, vschneid,
	linux-kernel

On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2022/7/15 19:18, Dietmar Eggemann wrote:
> > On 13/07/2022 06:04, Chengming Zhou wrote:
> >> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
> >> use update_load_avg() to implement attach/detach entity load_avg.
> >>
> >> Another advantage of using update_load_avg() is that it will check
> >> last_update_time before attach or detach, instead of unconditional
> >> attach/detach in the current code.
> >>
> >> This way can avoid some corner problematic cases of load tracking,
> >> like twice attach problem, detach unattached NEW task problem.
> >
> > This explanation is somewhat hard to follow for me. Since both issues
> > have been fixed already (you mention this further below) you're saying
> > that with you change you don't reintroduce them?
>
> Sorry for this not very clear explanation.
>
> Yes, both issues have been fixed already, what I want to say is that bugfix
> brings its own problem and limitation mentioned below.

As Dietmar said, the commit message is misleading because someone can
think you fix these bugs whereas it's not the case

>
> So I want to use another way to solve these problems better.
>
> >
> >> 1. switch to fair class (twice attach problem)
> >>
> >> p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
> >> if (queued)
> >>   enqueue_task(p);
> >>     ...
> >>       enqueue_entity()
> >>         update_load_avg(UPDATE_TG | DO_ATTACH)
> >>           if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
> >>             attach_entity_load_avg()  --> attached, will set last_update_time
> >> check_class_changed()
> >>   switched_from() (!fair)
> >>   switched_to()   (fair)
> >>     switched_to_fair()
> >>       attach_entity_load_avg()  --> unconditional attach again!
> >>
> >> 2. change cgroup of NEW task (detach unattached task problem)
> >>
> >> sched_move_group(p)
> >>   if (queued)
> >>     dequeue_task()
> >>   task_move_group_fair()
> >>     detach_task_cfs_rq()
> >>       detach_entity_load_avg()  --> detach unattached NEW task
> >>     set_task_rq()
> >>     attach_task_cfs_rq()
> >>       attach_entity_load_avg()
> >>   if (queued)
> >>     enqueue_task()
> >>
> >> These problems have been fixed in commit 7dc603c9028e
> >> ("sched/fair: Fix PELT integrity for new tasks"), which also
> >> bring its own problems.
> >>
> >> First, it add a new task state TASK_NEW and an unnessary limitation
> >> that we would fail when change the cgroup of TASK_NEW tasks.
>
> This is the limitation that bugfix has brought.
>
> We can't change cgroup or switch to fair for task with last_update_time=0
> if we don't have conditional detach/attach.
>
> So we have to:
>
> 1. !fair task also need to set last_update_time.
> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
>
> >>
> >> Second, it attach entity load_avg in post_init_entity_util_avg(),
> >> in which we only set sched_avg last_update_time for !fair tasks,
> >> will cause PELT integrity problem when switched_to_fair().
> >
> > I guess those PELT integrity problems are less severe since we have the
> > enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
> > we always decay the time the task spend outside fair.
>
> enqueue_task_fair()
>   enqueue_entity()
>     update_load_avg()
>       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>         __update_load_avg_se(now, cfs_rq, se);  --> (1)
>
> We can see above, for queued !fair task, (1) will deay the delta time
> (now - se.avg.last_update_time) even for a NEW !fair task.
>
> >
> > Looks to me that you want to replace this by your `freeze PELT when
> > outside fair` model.
>
> Yes, want to freeze PELT for two !fair cases:
>
> 1. !fair task hasn't been fair before: will still have its init load_avg
>    when switch to fair.

But I'm not sure it makes sense to keep these init values. As an
example, the util_avg is set according to the cpu utilization at the
time of the task creation. I would tend to decay them as these init
values become less and less relevant.

so we should return early in post_init_entity_util_avg() and don't set
util_avg if sched class is not cfs

>
> 2. !fair task has been switched_from_fair(): will still keep its lastest
>    load_avg when switch to fair.
>
> >
> >> This patch make update_load_avg() the only location of attach/detach,
> >> and can handle these corner cases like change cgroup of NEW tasks,
> >> by checking last_update_time before attach/detach.
> >
> > [...]
> >
> >> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> >>      struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>
> >>      /* Catch up with the cfs_rq and remove our load when we leave */
> >> -    update_load_avg(cfs_rq, se, 0);
> >> -    detach_entity_load_avg(cfs_rq, se);
> >> -    update_tg_load_avg(cfs_rq);
> >> +    update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> >
> > IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> > updated in this case.
>
> Correct, will do.
>
> Thanks.
>
> >
> > [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task
  2022-07-13  4:04 ` [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
@ 2022-07-19 12:35   ` Vincent Guittot
  2022-07-20 13:48     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2022-07-19 12:35 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On Wed, 13 Jul 2022 at 06:05, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> New forked !fair task will set its sched_avg last_update_time to
> the pelt_clock of cfs_rq, after a while in switched_to_fair():
>
> switched_to_fair
>   attach_task_cfs_rq
>     attach_entity_cfs_rq
>       update_load_avg
>         __update_load_avg_se(now, cfs_rq, se)
>
> the delta (now - sa->last_update_time) will contribute/decay sched_avg
> depends on the task running/runnable status at that time.
>
> This patch don't set sched_avg last_update_time of new forked !fair
> task, leave it to 0. So later in update_load_avg(), we don't need to
> contribute/decay the wrong delta (now - sa->last_update_time).

As mentioned in patch 7, I think it's wrong to not decay the init
value of !fair task because they become obsolete if not used quickly
(so we are decaying them)

It would be better to not set them at all in the case of !fair task

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/fair.c | 18 ++----------------
>  1 file changed, 2 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 51fc20c161a3..50f65a2ede32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -849,22 +849,8 @@ void post_init_entity_util_avg(struct task_struct *p)
>
>         sa->runnable_avg = sa->util_avg;
>
> -       if (p->sched_class != &fair_sched_class) {
> -               /*
> -                * For !fair tasks do:
> -                *
> -               update_cfs_rq_load_avg(now, cfs_rq);
> -               attach_entity_load_avg(cfs_rq, se);
> -               switched_from_fair(rq, p);
> -                *
> -                * such that the next switched_to_fair() has the
> -                * expected state.
> -                */
> -               se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
> -               return;
> -       }
> -
> -       attach_entity_cfs_rq(se);
> +       if (p->sched_class == &fair_sched_class)
> +               attach_entity_cfs_rq(se);
>  }
>
>  #else /* !CONFIG_SMP */
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-13  4:04 ` [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
  2022-07-14 12:33   ` Dietmar Eggemann
@ 2022-07-19 13:20   ` Vincent Guittot
  2022-07-27 10:55     ` Chengming Zhou
  1 sibling, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2022-07-19 13:20 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On Wed, 13 Jul 2022 at 06:05, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> The same reason as the previous commit, if we don't reset the
> sched_avg last_update_time to 0, after a while in switched_to_fair():
>
> switched_to_fair
>   attach_task_cfs_rq
>     attach_entity_cfs_rq
>       update_load_avg
>         __update_load_avg_se(now, cfs_rq, se)
>
> The delta (now - sa->last_update_time) will wrongly contribute/decay
> sched_avg depends on the task running/runnable status at that time.

I think that we always decay the time spent in !fair class and never contribute

This one is a bit more complex than the new task one. Generally
speaking, I would say that we should decay the load while running in
!fair case because the value becomes less and less relevant over time.
What's the meaning of pelt values when the task has run seconds as a
fifo ? The only case that would need a particular attention, is the pi
boost case but there is no way to now how long is runs and sleep when
!fair

>
> This patch reset it's sched_avg last_update_time to 0, stop load
> tracking for !fair task, later in switched_to_fair() ->
> update_load_avg(), we can use its saved sched_avg.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50f65a2ede32..576028f5a09e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>  {
>         detach_task_cfs_rq(p);
> +
> +#ifdef CONFIG_SMP
> +       /* Stop load tracking for !fair task */
> +       p->se.avg.last_update_time = 0;
> +#endif
>  }
>
>  static void switched_to_fair(struct rq *rq, struct task_struct *p)
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-15 16:21     ` [External] " Chengming Zhou
  2022-07-19 10:29       ` Vincent Guittot
@ 2022-07-19 15:02       ` Dietmar Eggemann
  2022-07-20 13:43         ` Chengming Zhou
  1 sibling, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2022-07-19 15:02 UTC (permalink / raw)
  To: Chengming Zhou, mingo, peterz, vincent.guittot, rostedt, bsegall,
	vschneid
  Cc: linux-kernel

On 15/07/2022 18:21, Chengming Zhou wrote:
> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>
>>> Another advantage of using update_load_avg() is that it will check
>>> last_update_time before attach or detach, instead of unconditional
>>> attach/detach in the current code.
>>>
>>> This way can avoid some corner problematic cases of load tracking,
>>> like twice attach problem, detach unattached NEW task problem.
>>
>> This explanation is somewhat hard to follow for me. Since both issues
>> have been fixed already (you mention this further below) you're saying
>> that with you change you don't reintroduce them?
> 
> Sorry for this not very clear explanation.
> 
> Yes, both issues have been fixed already, what I want to say is that bugfix
> brings its own problem and limitation mentioned below.
> 
> So I want to use another way to solve these problems better.

[...]

>>> These problems have been fixed in commit 7dc603c9028e
>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>> bring its own problems.
>>>
>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>> that we would fail when change the cgroup of TASK_NEW tasks.
> 
> This is the limitation that bugfix has brought.
> 
> We can't change cgroup or switch to fair for task with last_update_time=0
> if we don't have conditional detach/attach.
> 
> So we have to:
> 
> 1. !fair task also need to set last_update_time.
> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.

I see.

`cgroup_migrate_execute() -> cpu_cgroup_[can|]_attach()` has to wait for
`wake_up_new_task() -> WRITE_ONCE(p->__state, TASK_RUNNING)`.

Just to understand this change better: IMHO, this is still the case for
fair tasks, right?

`wake_up_new_task() -> post_init_entity_util_avg() ->
attach_entity_cfs_rq()` has to happen before the fair task can move
between taskgroups in `cgroup_migrate_execute() -> cpu_cgroup_attach()
-> sched_move_task() -> sched_change_group() -> task_change_group_fair()`.

[...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-19 10:29       ` Vincent Guittot
@ 2022-07-20 13:40         ` Chengming Zhou
  2022-07-20 15:34           ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-20 13:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, mingo, peterz, rostedt, bsegall, vschneid,
	linux-kernel

On 2022/7/19 18:29, Vincent Guittot wrote:
> On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>>
>>>> Another advantage of using update_load_avg() is that it will check
>>>> last_update_time before attach or detach, instead of unconditional
>>>> attach/detach in the current code.
>>>>
>>>> This way can avoid some corner problematic cases of load tracking,
>>>> like twice attach problem, detach unattached NEW task problem.
>>>
>>> This explanation is somewhat hard to follow for me. Since both issues
>>> have been fixed already (you mention this further below) you're saying
>>> that with you change you don't reintroduce them?
>>
>> Sorry for this not very clear explanation.
>>
>> Yes, both issues have been fixed already, what I want to say is that bugfix
>> brings its own problem and limitation mentioned below.
> 
> As Dietmar said, the commit message is misleading because someone can
> think you fix these bugs whereas it's not the case

Hi Vincent, thanks for your review! I will refactor the commit message to avoid
this misleading, sorry for my bad English expression.

> 
>>
>> So I want to use another way to solve these problems better.
>>
>>>
>>>> 1. switch to fair class (twice attach problem)
>>>>
>>>> p->sched_class = fair_class;  --> p.se->avg.last_update_time = 0
>>>> if (queued)
>>>>   enqueue_task(p);
>>>>     ...
>>>>       enqueue_entity()
>>>>         update_load_avg(UPDATE_TG | DO_ATTACH)
>>>>           if (!se->avg.last_update_time && (flags & DO_ATTACH))  --> true
>>>>             attach_entity_load_avg()  --> attached, will set last_update_time
>>>> check_class_changed()
>>>>   switched_from() (!fair)
>>>>   switched_to()   (fair)
>>>>     switched_to_fair()
>>>>       attach_entity_load_avg()  --> unconditional attach again!
>>>>
>>>> 2. change cgroup of NEW task (detach unattached task problem)
>>>>
>>>> sched_move_group(p)
>>>>   if (queued)
>>>>     dequeue_task()
>>>>   task_move_group_fair()
>>>>     detach_task_cfs_rq()
>>>>       detach_entity_load_avg()  --> detach unattached NEW task
>>>>     set_task_rq()
>>>>     attach_task_cfs_rq()
>>>>       attach_entity_load_avg()
>>>>   if (queued)
>>>>     enqueue_task()
>>>>
>>>> These problems have been fixed in commit 7dc603c9028e
>>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>>> bring its own problems.
>>>>
>>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>>> that we would fail when change the cgroup of TASK_NEW tasks.
>>
>> This is the limitation that bugfix has brought.
>>
>> We can't change cgroup or switch to fair for task with last_update_time=0
>> if we don't have conditional detach/attach.
>>
>> So we have to:
>>
>> 1. !fair task also need to set last_update_time.
>> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
>>
>>>>
>>>> Second, it attach entity load_avg in post_init_entity_util_avg(),
>>>> in which we only set sched_avg last_update_time for !fair tasks,
>>>> will cause PELT integrity problem when switched_to_fair().
>>>
>>> I guess those PELT integrity problems are less severe since we have the
>>> enqueue_task_fair() before the switched_to_fair() for enqueued tasks. So
>>> we always decay the time the task spend outside fair.
>>
>> enqueue_task_fair()
>>   enqueue_entity()
>>     update_load_avg()
>>       if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))  --> true
>>         __update_load_avg_se(now, cfs_rq, se);  --> (1)
>>
>> We can see above, for queued !fair task, (1) will deay the delta time
>> (now - se.avg.last_update_time) even for a NEW !fair task.
>>
>>>
>>> Looks to me that you want to replace this by your `freeze PELT when
>>> outside fair` model.
>>
>> Yes, want to freeze PELT for two !fair cases:
>>
>> 1. !fair task hasn't been fair before: will still have its init load_avg
>>    when switch to fair.
> 
> But I'm not sure it makes sense to keep these init values. As an
> example, the util_avg is set according to the cpu utilization at the
> time of the task creation. I would tend to decay them as these init
> values become less and less relevant.
> 
> so we should return early in post_init_entity_util_avg() and don't set
> util_avg if sched class is not cfs

Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
I'm also not sure whether it make sense to keep them to 0 ? Will it cause
unfairness problem between cfs_rqs?

> 
>>
>> 2. !fair task has been switched_from_fair(): will still keep its lastest
>>    load_avg when switch to fair.
>>
>>>
>>>> This patch make update_load_avg() the only location of attach/detach,
>>>> and can handle these corner cases like change cgroup of NEW tasks,
>>>> by checking last_update_time before attach/detach.
>>>
>>> [...]
>>>
>>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>>>>      struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>>
>>>>      /* Catch up with the cfs_rq and remove our load when we leave */
>>>> -    update_load_avg(cfs_rq, se, 0);
>>>> -    detach_entity_load_avg(cfs_rq, se);
>>>> -    update_tg_load_avg(cfs_rq);
>>>> +    update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>>>
>>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
>>> updated in this case.
>>
>> Correct, will do.
>>
>> Thanks.
>>
>>>
>>> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-19 15:02       ` Dietmar Eggemann
@ 2022-07-20 13:43         ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-20 13:43 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, vincent.guittot, rostedt,
	bsegall, vschneid
  Cc: linux-kernel

On 2022/7/19 23:02, Dietmar Eggemann wrote:
> On 15/07/2022 18:21, Chengming Zhou wrote:
>> On 2022/7/15 19:18, Dietmar Eggemann wrote:
>>> On 13/07/2022 06:04, Chengming Zhou wrote:
>>>> Since update_load_avg() support DO_ATTACH and DO_DETACH now, we can
>>>> use update_load_avg() to implement attach/detach entity load_avg.
>>>>
>>>> Another advantage of using update_load_avg() is that it will check
>>>> last_update_time before attach or detach, instead of unconditional
>>>> attach/detach in the current code.
>>>>
>>>> This way can avoid some corner problematic cases of load tracking,
>>>> like twice attach problem, detach unattached NEW task problem.
>>>
>>> This explanation is somewhat hard to follow for me. Since both issues
>>> have been fixed already (you mention this further below) you're saying
>>> that with you change you don't reintroduce them?
>>
>> Sorry for this not very clear explanation.
>>
>> Yes, both issues have been fixed already, what I want to say is that bugfix
>> brings its own problem and limitation mentioned below.
>>
>> So I want to use another way to solve these problems better.
> 
> [...]
> 
>>>> These problems have been fixed in commit 7dc603c9028e
>>>> ("sched/fair: Fix PELT integrity for new tasks"), which also
>>>> bring its own problems.
>>>>
>>>> First, it add a new task state TASK_NEW and an unnessary limitation
>>>> that we would fail when change the cgroup of TASK_NEW tasks.
>>
>> This is the limitation that bugfix has brought.
>>
>> We can't change cgroup or switch to fair for task with last_update_time=0
>> if we don't have conditional detach/attach.
>>
>> So we have to:
>>
>> 1. !fair task also need to set last_update_time.
>> 2. cpu_cgroup_can_attach() have to wait for TASK_NEW to fully attached.
> 
> I see.
> 
> `cgroup_migrate_execute() -> cpu_cgroup_[can|]_attach()` has to wait for
> `wake_up_new_task() -> WRITE_ONCE(p->__state, TASK_RUNNING)`.
> 
> Just to understand this change better: IMHO, this is still the case for
> fair tasks, right?

Yes, I think so. We could delete this limitation when we have conditional
detach/attach, since nothing will be detached when last_update_time==0.

Thanks!

> 
> `wake_up_new_task() -> post_init_entity_util_avg() ->
> attach_entity_cfs_rq()` has to happen before the fair task can move
> between taskgroups in `cgroup_migrate_execute() -> cpu_cgroup_attach()
> -> sched_move_task() -> sched_change_group() -> task_change_group_fair()`.
> 
> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task
  2022-07-19 12:35   ` Vincent Guittot
@ 2022-07-20 13:48     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-20 13:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On 2022/7/19 20:35, Vincent Guittot wrote:
> On Wed, 13 Jul 2022 at 06:05, Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> New forked !fair task will set its sched_avg last_update_time to
>> the pelt_clock of cfs_rq, after a while in switched_to_fair():
>>
>> switched_to_fair
>>   attach_task_cfs_rq
>>     attach_entity_cfs_rq
>>       update_load_avg
>>         __update_load_avg_se(now, cfs_rq, se)
>>
>> the delta (now - sa->last_update_time) will contribute/decay sched_avg
>> depends on the task running/runnable status at that time.
>>
>> This patch don't set sched_avg last_update_time of new forked !fair
>> task, leave it to 0. So later in update_load_avg(), we don't need to
>> contribute/decay the wrong delta (now - sa->last_update_time).
> 
> As mentioned in patch 7, I think it's wrong to not decay the init
> value of !fair task because they become obsolete if not used quickly
> (so we are decaying them)
> 
> It would be better to not set them at all in the case of !fair task

So just leave !fair task sched_avg to 0 and last_update_time == 0, right?


Thanks.

> 
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/fair.c | 18 ++----------------
>>  1 file changed, 2 insertions(+), 16 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 51fc20c161a3..50f65a2ede32 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -849,22 +849,8 @@ void post_init_entity_util_avg(struct task_struct *p)
>>
>>         sa->runnable_avg = sa->util_avg;
>>
>> -       if (p->sched_class != &fair_sched_class) {
>> -               /*
>> -                * For !fair tasks do:
>> -                *
>> -               update_cfs_rq_load_avg(now, cfs_rq);
>> -               attach_entity_load_avg(cfs_rq, se);
>> -               switched_from_fair(rq, p);
>> -                *
>> -                * such that the next switched_to_fair() has the
>> -                * expected state.
>> -                */
>> -               se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
>> -               return;
>> -       }
>> -
>> -       attach_entity_cfs_rq(se);
>> +       if (p->sched_class == &fair_sched_class)
>> +               attach_entity_cfs_rq(se);
>>  }
>>
>>  #else /* !CONFIG_SMP */
>> --
>> 2.36.1
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-20 13:40         ` Chengming Zhou
@ 2022-07-20 15:34           ` Vincent Guittot
  2022-07-21 13:56             ` Chengming Zhou
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2022-07-20 15:34 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Dietmar Eggemann, mingo, peterz, rostedt, bsegall, vschneid,
	linux-kernel

On Wed, 20 Jul 2022 at 15:41, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2022/7/19 18:29, Vincent Guittot wrote:
> > On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>

...

> >>
> >>>
> >>> Looks to me that you want to replace this by your `freeze PELT when
> >>> outside fair` model.
> >>
> >> Yes, want to freeze PELT for two !fair cases:
> >>
> >> 1. !fair task hasn't been fair before: will still have its init load_avg
> >>    when switch to fair.
> >
> > But I'm not sure it makes sense to keep these init values. As an
> > example, the util_avg is set according to the cpu utilization at the
> > time of the task creation. I would tend to decay them as these init
> > values become less and less relevant.
> >
> > so we should return early in post_init_entity_util_avg() and don't set
> > util_avg if sched class is not cfs
>
> Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
> I'm also not sure whether it make sense to keep them to 0 ? Will it cause
> unfairness problem between cfs_rqs?

Why should it cause an unfairness problem ? !fair tasks are not
accounted and their pelt values will be decayed down to 0 after 320ms
anyway (with the current implementation). So it's just like if you
started directly after those 320ms

>
> >
> >>
> >> 2. !fair task has been switched_from_fair(): will still keep its lastest
> >>    load_avg when switch to fair.
> >>
> >>>
> >>>> This patch make update_load_avg() the only location of attach/detach,
> >>>> and can handle these corner cases like change cgroup of NEW tasks,
> >>>> by checking last_update_time before attach/detach.
> >>>
> >>> [...]
> >>>
> >>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> >>>>      struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>>>
> >>>>      /* Catch up with the cfs_rq and remove our load when we leave */
> >>>> -    update_load_avg(cfs_rq, se, 0);
> >>>> -    detach_entity_load_avg(cfs_rq, se);
> >>>> -    update_tg_load_avg(cfs_rq);
> >>>> +    update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> >>>
> >>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> >>> updated in this case.
> >>
> >> Correct, will do.
> >>
> >> Thanks.
> >>
> >>>
> >>> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-20 15:34           ` Vincent Guittot
@ 2022-07-21 13:56             ` Chengming Zhou
  2022-07-21 14:13               ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Chengming Zhou @ 2022-07-21 13:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, mingo, peterz, rostedt, bsegall, vschneid,
	linux-kernel

On 2022/7/20 23:34, Vincent Guittot wrote:
> On Wed, 20 Jul 2022 at 15:41, Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2022/7/19 18:29, Vincent Guittot wrote:
>>> On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>>
> 
> ...
> 
>>>>
>>>>>
>>>>> Looks to me that you want to replace this by your `freeze PELT when
>>>>> outside fair` model.
>>>>
>>>> Yes, want to freeze PELT for two !fair cases:
>>>>
>>>> 1. !fair task hasn't been fair before: will still have its init load_avg
>>>>    when switch to fair.
>>>
>>> But I'm not sure it makes sense to keep these init values. As an
>>> example, the util_avg is set according to the cpu utilization at the
>>> time of the task creation. I would tend to decay them as these init
>>> values become less and less relevant.
>>>
>>> so we should return early in post_init_entity_util_avg() and don't set
>>> util_avg if sched class is not cfs
>>
>> Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
>> I'm also not sure whether it make sense to keep them to 0 ? Will it cause
>> unfairness problem between cfs_rqs?
> 
> Why should it cause an unfairness problem ? !fair tasks are not
> accounted and their pelt values will be decayed down to 0 after 320ms
> anyway (with the current implementation). So it's just like if you
> started directly after those 320ms

Thanks for your patient explain. IMHO, I am thinking if we have init sched_avg
for new fair task (A), but have 0 for new task switched from !fair (B). Then
what's the point of init sched_avg for the fair task?

The B task will need some time to reach its stable load value, so in this process
its cfs_rq may can't get enough shares? Imaging below scenario, if we have fair
task A and switched from !fair task B at the same time, could cause unfairness
between cfs0 and cfs1 ?

CPU0   tg   CPU1
  |  /    \  |
  | /      \ |
cfs0        cfs1
 (A)         (B)

If runnable_avg and util_avg are 0 when switched from !fair, so we need more time
to do load balance or CPU frequency adjust? I think it's the reason why we have
init sched_avg for new fair task. Should we care about these, or it will be no problem?

I'm not sure, I must have missed something :-)

Thanks!

> 
>>
>>>
>>>>
>>>> 2. !fair task has been switched_from_fair(): will still keep its lastest
>>>>    load_avg when switch to fair.
>>>>
>>>>>
>>>>>> This patch make update_load_avg() the only location of attach/detach,
>>>>>> and can handle these corner cases like change cgroup of NEW tasks,
>>>>>> by checking last_update_time before attach/detach.
>>>>>
>>>>> [...]
>>>>>
>>>>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
>>>>>>      struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>>>>
>>>>>>      /* Catch up with the cfs_rq and remove our load when we leave */
>>>>>> -    update_load_avg(cfs_rq, se, 0);
>>>>>> -    detach_entity_load_avg(cfs_rq, se);
>>>>>> -    update_tg_load_avg(cfs_rq);
>>>>>> +    update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
>>>>>
>>>>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
>>>>> updated in this case.
>>>>
>>>> Correct, will do.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [External] Re: [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg
  2022-07-21 13:56             ` Chengming Zhou
@ 2022-07-21 14:13               ` Vincent Guittot
  0 siblings, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2022-07-21 14:13 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Dietmar Eggemann, mingo, peterz, rostedt, bsegall, vschneid,
	linux-kernel

On Thu, 21 Jul 2022 at 15:56, Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2022/7/20 23:34, Vincent Guittot wrote:
> > On Wed, 20 Jul 2022 at 15:41, Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> On 2022/7/19 18:29, Vincent Guittot wrote:
> >>> On Fri, 15 Jul 2022 at 18:21, Chengming Zhou
> >>> <zhouchengming@bytedance.com> wrote:
> >>>>
> >
> > ...
> >
> >>>>
> >>>>>
> >>>>> Looks to me that you want to replace this by your `freeze PELT when
> >>>>> outside fair` model.
> >>>>
> >>>> Yes, want to freeze PELT for two !fair cases:
> >>>>
> >>>> 1. !fair task hasn't been fair before: will still have its init load_avg
> >>>>    when switch to fair.
> >>>
> >>> But I'm not sure it makes sense to keep these init values. As an
> >>> example, the util_avg is set according to the cpu utilization at the
> >>> time of the task creation. I would tend to decay them as these init
> >>> values become less and less relevant.
> >>>
> >>> so we should return early in post_init_entity_util_avg() and don't set
> >>> util_avg if sched class is not cfs
> >>
> >> Yes, this indeed is a problem if we attach this init sched_avg of !fair task.
> >> I'm also not sure whether it make sense to keep them to 0 ? Will it cause
> >> unfairness problem between cfs_rqs?
> >
> > Why should it cause an unfairness problem ? !fair tasks are not
> > accounted and their pelt values will be decayed down to 0 after 320ms
> > anyway (with the current implementation). So it's just like if you
> > started directly after those 320ms
>
> Thanks for your patient explain. IMHO, I am thinking if we have init sched_avg
> for new fair task (A), but have 0 for new task switched from !fair (B). Then
> what's the point of init sched_avg for the fair task?

For fair task, the util_avg is set according to cpu's util_avg at fork
time as an estimate of its coming utilization. The load_is also set to
max to ensure an min share. But these init value have a real impact on
the first 320ms. after they have just disappear from the *_avg.

For !fair task, these init value will be decayed when
swicthed_to_fair(). This means that if a !fair task stays more than
320ms in !fair class (running, runnable or sleeping) before switching
into fair class, those init pelt value will be decayed to 0.
If we keep these init value, the initial util_avg value which has been
set according to the cpu's util_avg at fork will stay at this value
which is no more relevant

>
> The B task will need some time to reach its stable load value, so in this process
> its cfs_rq may can't get enough shares? Imaging below scenario, if we have fair
> task A and switched from !fair task B at the same time, could cause unfairness
> between cfs0 and cfs1 ?
>
> CPU0   tg   CPU1
>   |  /    \  |
>   | /      \ |
> cfs0        cfs1
>  (A)         (B)
>
> If runnable_avg and util_avg are 0 when switched from !fair, so we need more time
> to do load balance or CPU frequency adjust? I think it's the reason why we have
> init sched_avg for new fair task. Should we care about these, or it will be no problem?

But the util_avg which has been set at fork time,  has been set
according to a cpu util_avg which is maybe hundreds of seconds old.

>
> I'm not sure, I must have missed something :-)
>
> Thanks!
>
> >
> >>
> >>>
> >>>>
> >>>> 2. !fair task has been switched_from_fair(): will still keep its lastest
> >>>>    load_avg when switch to fair.
> >>>>
> >>>>>
> >>>>>> This patch make update_load_avg() the only location of attach/detach,
> >>>>>> and can handle these corner cases like change cgroup of NEW tasks,
> >>>>>> by checking last_update_time before attach/detach.
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> @@ -11527,9 +11522,7 @@ static void detach_entity_cfs_rq(struct sched_entity *se)
> >>>>>>      struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>>>>>
> >>>>>>      /* Catch up with the cfs_rq and remove our load when we leave */
> >>>>>> -    update_load_avg(cfs_rq, se, 0);
> >>>>>> -    detach_entity_load_avg(cfs_rq, se);
> >>>>>> -    update_tg_load_avg(cfs_rq);
> >>>>>> +    update_load_avg(cfs_rq, se, UPDATE_TG | DO_DETACH);
> >>>>>
> >>>>> IMHO, the DO_[DE|AT]TACH comments in update_load_avg() would have to be
> >>>>> updated in this case.
> >>>>
> >>>> Correct, will do.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>> [...]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair()
  2022-07-19 13:20   ` Vincent Guittot
@ 2022-07-27 10:55     ` Chengming Zhou
  0 siblings, 0 replies; 37+ messages in thread
From: Chengming Zhou @ 2022-07-27 10:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, dietmar.eggemann, rostedt, bsegall, vschneid,
	linux-kernel

On 2022/7/19 21:20, Vincent Guittot wrote:
> On Wed, 13 Jul 2022 at 06:05, Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> The same reason as the previous commit, if we don't reset the
>> sched_avg last_update_time to 0, after a while in switched_to_fair():
>>
>> switched_to_fair
>>   attach_task_cfs_rq
>>     attach_entity_cfs_rq
>>       update_load_avg
>>         __update_load_avg_se(now, cfs_rq, se)
>>
>> The delta (now - sa->last_update_time) will wrongly contribute/decay
>> sched_avg depends on the task running/runnable status at that time.
> 
> I think that we always decay the time spent in !fair class and never contribute
> 
> This one is a bit more complex than the new task one. Generally
> speaking, I would say that we should decay the load while running in
> !fair case because the value becomes less and less relevant over time.
> What's the meaning of pelt values when the task has run seconds as a
> fifo ? The only case that would need a particular attention, is the pi
> boost case but there is no way to now how long is runs and sleep when
> !fair

After thought for a while, I agree with you on that the PELT values become
meaningless after running as a FIFO task. So yes it's best for now to always
decay them when switched_to_fair().

So in the next version:
1. don't set for !fair in post_init_entity_util_avg()
2. keep last_update_time when switched_from_fair()
3. should we still keep SKIP_AGE_LOAD ? I don't understand what's its usage
   in switched_to_fair(). Or should we just delete it?

Thanks!

> 
>>
>> This patch reset it's sched_avg last_update_time to 0, stop load
>> tracking for !fair task, later in switched_to_fair() ->
>> update_load_avg(), we can use its saved sched_avg.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/fair.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 50f65a2ede32..576028f5a09e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11552,6 +11552,11 @@ static void attach_task_cfs_rq(struct task_struct *p)
>>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>>  {
>>         detach_task_cfs_rq(p);
>> +
>> +#ifdef CONFIG_SMP
>> +       /* Stop load tracking for !fair task */
>> +       p->se.avg.last_update_time = 0;
>> +#endif
>>  }
>>
>>  static void switched_to_fair(struct rq *rq, struct task_struct *p)
>> --
>> 2.36.1
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-07-27 10:55 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13  4:04 [PATCH v2 00/10] sched: task load tracking optimization and cleanup Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 01/10] sched/fair: combine detach into dequeue when migrating task Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 02/10] sched/fair: update comments in enqueue/dequeue_entity() Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 03/10] sched/fair: maintain task se depth in set_task_rq() Chengming Zhou
2022-07-14 12:30   ` Dietmar Eggemann
2022-07-14 13:03     ` [External] " Chengming Zhou
2022-07-18  7:16   ` Vincent Guittot
2022-07-13  4:04 ` [PATCH v2 04/10] sched/fair: remove redundant cpu_cgrp_subsys->fork() Chengming Zhou
2022-07-14 12:31   ` Dietmar Eggemann
2022-07-14 13:06     ` [External] " Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 05/10] sched/fair: reset sched_avg last_update_time before set_task_rq() Chengming Zhou
2022-07-14 12:31   ` Dietmar Eggemann
2022-07-19  8:49   ` Vincent Guittot
2022-07-13  4:04 ` [PATCH v2 06/10] sched/fair: delete superfluous SKIP_AGE_LOAD Chengming Zhou
2022-07-14 12:33   ` Dietmar Eggemann
2022-07-14 13:24     ` [External] " Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 07/10] sched/fair: use update_load_avg() to attach/detach entity load_avg Chengming Zhou
2022-07-15 11:18   ` Dietmar Eggemann
2022-07-15 16:21     ` [External] " Chengming Zhou
2022-07-19 10:29       ` Vincent Guittot
2022-07-20 13:40         ` Chengming Zhou
2022-07-20 15:34           ` Vincent Guittot
2022-07-21 13:56             ` Chengming Zhou
2022-07-21 14:13               ` Vincent Guittot
2022-07-19 15:02       ` Dietmar Eggemann
2022-07-20 13:43         ` Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 08/10] sched/fair: fix load tracking for new forked !fair task Chengming Zhou
2022-07-19 12:35   ` Vincent Guittot
2022-07-20 13:48     ` [External] " Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 09/10] sched/fair: stop load tracking when task switched_from_fair() Chengming Zhou
2022-07-14 12:33   ` Dietmar Eggemann
2022-07-14 13:43     ` [External] " Chengming Zhou
2022-07-15 11:15       ` Dietmar Eggemann
2022-07-15 16:35         ` Chengming Zhou
2022-07-19 13:20   ` Vincent Guittot
2022-07-27 10:55     ` Chengming Zhou
2022-07-13  4:04 ` [PATCH v2 10/10] sched/fair: delete superfluous set_task_rq_fair() Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).