[PATCH v4 0/3] sched: account fair load avg consistently

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] sched: account fair load avg consistently
@ 2015-10-23 16:16 byungchul.park
  2015-10-23 16:16 ` [PATCH v4 1/3] sched/fair: make it possible to " byungchul.park
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: byungchul.park @ 2015-10-23 16:16 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, yuyang.du, pjt, efault, tglx, Byungchul Park

From: Byungchul Park <byungchul.park@lge.com>

* change from v3 to v4
- optimize - force to use rmb() once when doing migration
- optimize - not add additional variable to task_struct
- change the order of migrate_task_rq() and __set_task_cpu()

* change from v2 to v3
- consider otimization in the case of migration
- split patches to 3 to be reviewed easily

* change from v1 to v2
- make set_task_rq() do that role instead of migration callback
- make set_task_rq() do that role instead of move group callback
- remove the dependancy between last_update_time and check for migration

Byungchul Park (3):
  sched/fair: make it possible to account fair load avg consistently
  sched/fair: split the remove_entity_load_avg() into two functions
  sched: optimize migration by forcing rmb() and updating to be called
    once

 kernel/sched/core.c  |   10 +++---
 kernel/sched/fair.c  |   97 +++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h |    4 +++
 3 files changed, 87 insertions(+), 24 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] sched/fair: make it possible to account fair load avg consistently
  2015-10-23 16:16 [PATCH v4 0/3] sched: account fair load avg consistently byungchul.park
@ 2015-10-23 16:16 ` byungchul.park
  2015-12-04 11:54   ` [tip:sched/core] sched/fair: Make " tip-bot for Byungchul Park
  2015-10-23 16:16 ` [PATCH v4 2/3] sched/fair: split the remove_entity_load_avg() into two functions byungchul.park
  2015-10-23 16:16 ` [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once byungchul.park
  2 siblings, 1 reply; 16+ messages in thread
From: byungchul.park @ 2015-10-23 16:16 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, yuyang.du, pjt, efault, tglx, Byungchul Park

From: Byungchul Park <byungchul.park@lge.com>

Current code can account fair class load average for the time the task
was absent from the fair class thanks to ATTACH_AGE_LOAD. However, it
doesn't work in the cases that either migration or group change happened
in the other sched classes.

This patch introduces more general way to care fair load average
accounting, and it works consistently in any case e.g. migration or
cgroup change in other sched classes.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/fair.c  |   45 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    4 ++++
 3 files changed, 50 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a91df61..0368054 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2075,6 +2075,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.cfs_rq			= NULL;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 077076f..e9c5668 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2808,6 +2808,51 @@ dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Called within set_task_rq() right before setting a task's cpu. The
+ * caller only guarantees p->pi_lock is held; no other assumptions,
+ * including the state of rq->lock, should be made.
+ */
+void set_task_rq_fair(struct sched_entity *se,
+			     struct cfs_rq *prev,
+			     struct cfs_rq *next)
+{
+	if (!sched_feat(ATTACH_AGE_LOAD)) return;
+
+	/*
+	 * We are supposed to update the task to "current" time, then its up to date
+	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
+	 * what current time is, so simply throw away the out-of-date time. This
+	 * will result in the wakee task is less decayed, but giving the wakee more
+	 * load sounds not bad.
+	 */
+	if (se->avg.last_update_time && prev) {
+		u64 p_last_update_time;
+		u64 n_last_update_time;
+
+#ifndef CONFIG_64BIT
+		u64 p_last_update_time_copy;
+		u64 n_last_update_time_copy;
+
+		do {
+			p_last_update_time_copy = prev->load_last_update_time_copy;
+			n_last_update_time_copy = next->load_last_update_time_copy;
+			smp_rmb();
+			p_last_update_time = prev->avg.last_update_time;
+			n_last_update_time = next->avg.last_update_time;
+		} while (p_last_update_time != p_last_update_time_copy ||
+			 n_last_update_time != n_last_update_time_copy);
+#else
+		p_last_update_time = prev->avg.last_update_time;
+		n_last_update_time = next->avg.last_update_time;
+#endif
+		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)), &se->avg, 0, 0, NULL);
+		se->avg.last_update_time = n_last_update_time;
+	}
+}
+#endif
+
 /*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index af6f252..363cf9c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,6 +335,9 @@ extern void sched_move_task(struct task_struct *tsk);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
+extern void set_task_rq_fair(struct sched_entity *se,
+				    struct cfs_rq *prev,
+				    struct cfs_rq *next);
 #endif
 
 #else /* CONFIG_CGROUP_SCHED */
@@ -933,6 +936,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
 #endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] sched/fair: split the remove_entity_load_avg() into two functions
  2015-10-23 16:16 [PATCH v4 0/3] sched: account fair load avg consistently byungchul.park
  2015-10-23 16:16 ` [PATCH v4 1/3] sched/fair: make it possible to " byungchul.park
@ 2015-10-23 16:16 ` byungchul.park
  2015-10-23 16:16 ` [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once byungchul.park
  2 siblings, 0 replies; 16+ messages in thread
From: byungchul.park @ 2015-10-23 16:16 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, yuyang.du, pjt, efault, tglx, Byungchul Park

From: Byungchul Park <byungchul.park@lge.com>

remove_entity_load_avg() consists of two parts. the first part is for
updating se's last_update_time and the second part is for removing
se's load from cfs_rq. it can become necessary to use only the first
part or second part, for the purpose of optimization. so this patch
splits this function into two other functions.

additionally, remove_entity_load_avg() is performed with a se and the
se's current cfs_rq, mostly. however it can become necessary to
perform it with a se and previous cfs_rq during migration for the
purpose of optimization. so this patch adds another a agument, that
is, cfs_rq to remove_entity_load_avg().

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 kernel/sched/fair.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9c5668..522aa07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2853,13 +2853,9 @@ void set_task_rq_fair(struct sched_entity *se,
 }
 #endif
 
-/*
- * Task first catches up with cfs_rq, and then subtract
- * itself from the cfs_rq (task must be off the queue now).
- */
-void remove_entity_load_avg(struct sched_entity *se)
+/* This is useful when rq lock may not be held */
+static inline void __update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 last_update_time;
 
 #ifndef CONFIG_64BIT
@@ -2875,11 +2871,25 @@ void remove_entity_load_avg(struct sched_entity *se)
 #endif
 
 	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+}
+
+static inline void __remove_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq)
+{
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
 
 /*
+ * Task first catches up with cfs_rq, and then subtract
+ * itself from the cfs_rq (task must be off the queue now).
+ */
+static inline void remove_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq)
+{
+	__update_entity_load_avg(se, cfs_rq);
+	__remove_entity_load_avg(se, cfs_rq);
+}
+
+/*
  * Update the rq's load with the elapsed running time before entering
  * idle. if the last scheduled task is not a CFS task, idle_enter will
  * be the only way to update the runnable statistic.
@@ -2916,7 +2926,8 @@ static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
-static inline void remove_entity_load_avg(struct sched_entity *se) {}
+static inline void
+remove_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq) {}
 
 static inline void
 attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
@@ -5063,7 +5074,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 	 * will result in the wakee task is less decayed, but giving the wakee more
 	 * load sounds not bad.
 	 */
-	remove_entity_load_avg(&p->se);
+	remove_entity_load_avg(&p->se, cfs_rq_of(&p->se));
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
@@ -5074,7 +5085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	remove_entity_load_avg(&p->se, cfs_rq_of(&p->se));
 }
 #endif /* CONFIG_SMP */
 
@@ -8144,7 +8155,7 @@ void free_fair_sched_group(struct task_group *tg)
 			kfree(tg->cfs_rq[i]);
 		if (tg->se) {
 			if (tg->se[i])
-				remove_entity_load_avg(tg->se[i]);
+				remove_entity_load_avg(tg->se[i], cfs_rq_of(tg->se[i]));
 			kfree(tg->se[i]);
 		}
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-10-23 16:16 [PATCH v4 0/3] sched: account fair load avg consistently byungchul.park
  2015-10-23 16:16 ` [PATCH v4 1/3] sched/fair: make it possible to " byungchul.park
  2015-10-23 16:16 ` [PATCH v4 2/3] sched/fair: split the remove_entity_load_avg() into two functions byungchul.park
@ 2015-10-23 16:16 ` byungchul.park
  2015-11-09 13:29   ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: byungchul.park @ 2015-10-23 16:16 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, yuyang.du, pjt, efault, tglx, Byungchul Park

From: Byungchul Park <byungchul.park@lge.com>

the "sched/fair: make it possible to account fair load avg consistently"
patch makes rmb() and updating last_update_time called twice when doing a
migration, which can be negative at performance. actually we can optimize
it by omiting the updating part of remove_entity_load_avg().

we can optimize it by changing the order of migrate_task_rq() and
and __set_task_cpu(), and removing the update part of
remove_entity_load_avg(). by this, we can ensure updating was already
done when the migrate_task_rq() is called.

this patch also changes the migrate_task_rq()'s second argument from
new_cpu to prev_cpu because the migrate_task_rq() is changed to be called
after setting rq. additionally, this patch changes comment properly.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
---
 kernel/sched/core.c |    9 +++++----
 kernel/sched/fair.c |   27 ++++++++++++++-------------
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0368054..99b05d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
+	unsigned int prev_cpu = task_cpu(p);
+
 #ifdef CONFIG_SCHED_DEBUG
 	/*
 	 * We should never call set_task_cpu() on a blocked task,
@@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #endif
 
 	trace_sched_migrate_task(p, new_cpu);
+	__set_task_cpu(p, new_cpu);
 
-	if (task_cpu(p) != new_cpu) {
+	if (prev_cpu != new_cpu) {
 		if (p->sched_class->migrate_task_rq)
-			p->sched_class->migrate_task_rq(p, new_cpu);
+			p->sched_class->migrate_task_rq(p, prev_cpu);
 		p->se.nr_migrations++;
 		perf_event_task_migrate(p);
 	}
-
-	__set_task_cpu(p, new_cpu);
 }
 
 static void __migrate_swap_task(struct task_struct *p, int cpu)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 522aa07..c9caf83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5060,21 +5060,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 }
 
 /*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
+ * Called immediately after a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call identify the next cpu. However,
+ * the caller only guarantees p->pi_lock is held; no other assumptions,
+ * including the state of rq->lock, should be made.
  */
-static void migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_fair(struct task_struct *p, int prev_cpu)
 {
-	/*
-	 * We are supposed to update the task to "current" time, then its up to date
-	 * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
-	 * what current time is, so simply throw away the out-of-date time. This
-	 * will result in the wakee task is less decayed, but giving the wakee more
-	 * load sounds not bad.
-	 */
-	remove_entity_load_avg(&p->se, cfs_rq_of(&p->se));
+#ifdef CONFIG_CGROUP_SCHED
+	struct cfs_rq *cfs_rq = task_group(p)->cfs_rq[prev_cpu];
+#else
+	struct cfs_rq *cfs_rq = cpu_rq(prev_cpu)->cfs;
+#endif
+
+	if (!sched_feat(ATTACH_AGE_LOAD))
+		__update_entity_load_avg(&p->se, cfs_rq);
+	__remove_entity_load_avg(&p->se, cfs_rq);
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-10-23 16:16 ` [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once byungchul.park
@ 2015-11-09 13:29   ` Peter Zijlstra
  2015-11-10  1:09     ` Byungchul Park
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-11-09 13:29 UTC (permalink / raw)
  To: byungchul.park; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Sat, Oct 24, 2015 at 01:16:21AM +0900, byungchul.park@lge.com wrote:
> +++ b/kernel/sched/core.c
> @@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
>  
>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  {
> +	unsigned int prev_cpu = task_cpu(p);
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  	/*
>  	 * We should never call set_task_cpu() on a blocked task,
> @@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  #endif
>  
>  	trace_sched_migrate_task(p, new_cpu);
> +	__set_task_cpu(p, new_cpu);
>  
> -	if (task_cpu(p) != new_cpu) {
> +	if (prev_cpu != new_cpu) {
>  		if (p->sched_class->migrate_task_rq)
> -			p->sched_class->migrate_task_rq(p, new_cpu);
> +			p->sched_class->migrate_task_rq(p, prev_cpu);
>  		p->se.nr_migrations++;
>  		perf_event_task_migrate(p);
>  	}
> -
> -	__set_task_cpu(p, new_cpu);
>  }

I don't think this is safe, see the comment in __set_task_cpu(). We want
that to be last.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-09 13:29   ` Peter Zijlstra
@ 2015-11-10  1:09     ` Byungchul Park
  2015-11-10 12:16       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Byungchul Park @ 2015-11-10  1:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Mon, Nov 09, 2015 at 02:29:14PM +0100, Peter Zijlstra wrote:
> On Sat, Oct 24, 2015 at 01:16:21AM +0900, byungchul.park@lge.com wrote:
> > +++ b/kernel/sched/core.c
> > @@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
> >  
> >  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >  {
> > +	unsigned int prev_cpu = task_cpu(p);
> > +
> >  #ifdef CONFIG_SCHED_DEBUG
> >  	/*
> >  	 * We should never call set_task_cpu() on a blocked task,
> > @@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >  #endif
> >  
> >  	trace_sched_migrate_task(p, new_cpu);
> > +	__set_task_cpu(p, new_cpu);
> >  
> > -	if (task_cpu(p) != new_cpu) {
> > +	if (prev_cpu != new_cpu) {
> >  		if (p->sched_class->migrate_task_rq)
> > -			p->sched_class->migrate_task_rq(p, new_cpu);
> > +			p->sched_class->migrate_task_rq(p, prev_cpu);
> >  		p->se.nr_migrations++;
> >  		perf_event_task_migrate(p);
> >  	}
> > -
> > -	__set_task_cpu(p, new_cpu);
> >  }
> 
> I don't think this is safe, see the comment in __set_task_cpu(). We want
> that to be last.

I am sorry but I don't understand what you said. I checked the comment in 
__set_task_cpu().

	/*
	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
	 * successfuly executed on another CPU. We must ensure that updates of
	 * per-task data have been completed by this moment.
	 */

Of course, ->cpu should be set up to a new value for task_rq_lock() to be
executed successfully on another CPU. Is this the case? Is there something
i missed? I think it would be ok if task->pi_lock can work correctly within
"if" statement in set_task_cpu(). Is there problem to do that?

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-10  1:09     ` Byungchul Park
@ 2015-11-10 12:16       ` Peter Zijlstra
  2015-11-10 23:51         ` Byungchul Park
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-11-10 12:16 UTC (permalink / raw)
  To: Byungchul Park; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Tue, Nov 10, 2015 at 10:09:05AM +0900, Byungchul Park wrote:
> On Mon, Nov 09, 2015 at 02:29:14PM +0100, Peter Zijlstra wrote:
> > On Sat, Oct 24, 2015 at 01:16:21AM +0900, byungchul.park@lge.com wrote:
> > > +++ b/kernel/sched/core.c
> > > @@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
> > >  
> > >  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > >  {
> > > +	unsigned int prev_cpu = task_cpu(p);
> > > +
> > >  #ifdef CONFIG_SCHED_DEBUG
> > >  	/*
> > >  	 * We should never call set_task_cpu() on a blocked task,
> > > @@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > >  #endif
> > >  
> > >  	trace_sched_migrate_task(p, new_cpu);
> > > +	__set_task_cpu(p, new_cpu);
> > >  
> > > -	if (task_cpu(p) != new_cpu) {
> > > +	if (prev_cpu != new_cpu) {
> > >  		if (p->sched_class->migrate_task_rq)
> > > -			p->sched_class->migrate_task_rq(p, new_cpu);
> > > +			p->sched_class->migrate_task_rq(p, prev_cpu);
> > >  		p->se.nr_migrations++;
> > >  		perf_event_task_migrate(p);
> > >  	}
> > > -
> > > -	__set_task_cpu(p, new_cpu);
> > >  }
> > 
> > I don't think this is safe, see the comment in __set_task_cpu(). We want
> > that to be last.
> 
> I am sorry but I don't understand what you said. I checked the comment in 
> __set_task_cpu().
> 
> 	/*
> 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
> 	 * successfuly executed on another CPU. We must ensure that updates of
> 	 * per-task data have been completed by this moment.
> 	 */
> 
> Of course, ->cpu should be set up to a new value for task_rq_lock() to be
> executed successfully on another CPU. Is this the case? Is there something
> i missed? I think it would be ok if task->pi_lock can work correctly within
> "if" statement in set_task_cpu(). Is there problem to do that?

So the problem is that as soon as that ->cpu store comes through, the
other rq->lock can happen, even though we might still hold a rq->lock
thinking we're serialized.

Take for instance move_queued_tasks(), it does:

	dequeue_task(rq, p, 0);
	p->on_rq = TASK_ON_RQ_MIGRATING;
	set_task_cpu(p, new_cpu) {
	  __set_task_cpu();

^^^ here holding rq->lock is insufficient and the below:

	  p->sched_class->migrate_task_rq()

would no longer be serialized by rq->lock.

	}
	raw_spin_unlock(&rq->lock);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-10 12:16       ` Peter Zijlstra
@ 2015-11-10 23:51         ` Byungchul Park
  2015-11-11 10:15           ` Byungchul Park
  2015-11-16 12:53           ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Byungchul Park @ 2015-11-10 23:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Tue, Nov 10, 2015 at 01:16:47PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 10, 2015 at 10:09:05AM +0900, Byungchul Park wrote:
> > On Mon, Nov 09, 2015 at 02:29:14PM +0100, Peter Zijlstra wrote:
> > > On Sat, Oct 24, 2015 at 01:16:21AM +0900, byungchul.park@lge.com wrote:
> > > > +++ b/kernel/sched/core.c
> > > > @@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
> > > >  
> > > >  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > >  {
> > > > +	unsigned int prev_cpu = task_cpu(p);
> > > > +
> > > >  #ifdef CONFIG_SCHED_DEBUG
> > > >  	/*
> > > >  	 * We should never call set_task_cpu() on a blocked task,
> > > > @@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > >  #endif
> > > >  
> > > >  	trace_sched_migrate_task(p, new_cpu);
> > > > +	__set_task_cpu(p, new_cpu);
> > > >  
> > > > -	if (task_cpu(p) != new_cpu) {
> > > > +	if (prev_cpu != new_cpu) {
> > > >  		if (p->sched_class->migrate_task_rq)
> > > > -			p->sched_class->migrate_task_rq(p, new_cpu);
> > > > +			p->sched_class->migrate_task_rq(p, prev_cpu);
> > > >  		p->se.nr_migrations++;
> > > >  		perf_event_task_migrate(p);
> > > >  	}
> > > > -
> > > > -	__set_task_cpu(p, new_cpu);
> > > >  }
> > > 
> > > I don't think this is safe, see the comment in __set_task_cpu(). We want
> > > that to be last.
> > 
> > I am sorry but I don't understand what you said. I checked the comment in 
> > __set_task_cpu().
> > 
> > 	/*
> > 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
> > 	 * successfuly executed on another CPU. We must ensure that updates of
> > 	 * per-task data have been completed by this moment.
> > 	 */
> > 
> > Of course, ->cpu should be set up to a new value for task_rq_lock() to be
> > executed successfully on another CPU. Is this the case? Is there something
> > i missed? I think it would be ok if task->pi_lock can work correctly within
> > "if" statement in set_task_cpu(). Is there problem to do that?
> 
> So the problem is that as soon as that ->cpu store comes through, the
> other rq->lock can happen, even though we might still hold a rq->lock
> thinking we're serialized.
> 
> Take for instance move_queued_tasks(), it does:
> 
> 	dequeue_task(rq, p, 0);
> 	p->on_rq = TASK_ON_RQ_MIGRATING;
> 	set_task_cpu(p, new_cpu) {
> 	  __set_task_cpu();
> 
> ^^^ here holding rq->lock is insufficient and the below:
> 
> 	  p->sched_class->migrate_task_rq()

Thank you for explaning in detail, but this's why i asked you.
Yes, rq->lock is insufficient in this place as you said, but
should migrate_task_rq() be serialized by rq->lock? I might have
agreed with you if the migrate_task_rq() should be serialized by
rq->lock, but I think it's not the case. I think it would be of
if task->pi_lock can work correcly within *if statement* in 
set_task_cpu(). Wrong?

> 
> would no longer be serialized by rq->lock.
> 
> 	}
> 	raw_spin_unlock(&rq->lock);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-10 23:51         ` Byungchul Park
@ 2015-11-11 10:15           ` Byungchul Park
  2015-11-16 12:53           ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Byungchul Park @ 2015-11-11 10:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Wed, Nov 11, 2015 at 08:51:47AM +0900, Byungchul Park wrote:
> On Tue, Nov 10, 2015 at 01:16:47PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 10, 2015 at 10:09:05AM +0900, Byungchul Park wrote:
> > > On Mon, Nov 09, 2015 at 02:29:14PM +0100, Peter Zijlstra wrote:
> > > > On Sat, Oct 24, 2015 at 01:16:21AM +0900, byungchul.park@lge.com wrote:
> > > > > +++ b/kernel/sched/core.c
> > > > > @@ -1264,6 +1264,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
> > > > >  
> > > > >  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > > >  {
> > > > > +	unsigned int prev_cpu = task_cpu(p);
> > > > > +
> > > > >  #ifdef CONFIG_SCHED_DEBUG
> > > > >  	/*
> > > > >  	 * We should never call set_task_cpu() on a blocked task,
> > > > > @@ -1289,15 +1291,14 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > > >  #endif
> > > > >  
> > > > >  	trace_sched_migrate_task(p, new_cpu);
> > > > > +	__set_task_cpu(p, new_cpu);
> > > > >  
> > > > > -	if (task_cpu(p) != new_cpu) {
> > > > > +	if (prev_cpu != new_cpu) {
> > > > >  		if (p->sched_class->migrate_task_rq)
> > > > > -			p->sched_class->migrate_task_rq(p, new_cpu);
> > > > > +			p->sched_class->migrate_task_rq(p, prev_cpu);
> > > > >  		p->se.nr_migrations++;
> > > > >  		perf_event_task_migrate(p);
> > > > >  	}
> > > > > -
> > > > > -	__set_task_cpu(p, new_cpu);
> > > > >  }
> > > > 
> > > > I don't think this is safe, see the comment in __set_task_cpu(). We want
> > > > that to be last.
> > > 
> > > I am sorry but I don't understand what you said. I checked the comment in 
> > > __set_task_cpu().
> > > 
> > > 	/*
> > > 	 * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
> > > 	 * successfuly executed on another CPU. We must ensure that updates of
> > > 	 * per-task data have been completed by this moment.
> > > 	 */
> > > 
> > > Of course, ->cpu should be set up to a new value for task_rq_lock() to be
> > > executed successfully on another CPU. Is this the case? Is there something
> > > i missed? I think it would be ok if task->pi_lock can work correctly within
> > > "if" statement in set_task_cpu(). Is there problem to do that?
> > 
> > So the problem is that as soon as that ->cpu store comes through, the
> > other rq->lock can happen, even though we might still hold a rq->lock
> > thinking we're serialized.
> > 
> > Take for instance move_queued_tasks(), it does:
> > 
> > 	dequeue_task(rq, p, 0);
> > 	p->on_rq = TASK_ON_RQ_MIGRATING;
> > 	set_task_cpu(p, new_cpu) {
> > 	  __set_task_cpu();
> > 
> > ^^^ here holding rq->lock is insufficient and the below:
> > 
> > 	  p->sched_class->migrate_task_rq()
> 
> Thank you for explaning in detail, but this's why i asked you.
> Yes, rq->lock is insufficient in this place as you said, but
> should migrate_task_rq() be serialized by rq->lock? I might have
> agreed with you if the migrate_task_rq() should be serialized by
> rq->lock, but I think it's not the case. I think it would be of

rq->lock, but I think it's not the case. I think it would be *ok*

(sorry for typo)

> if task->pi_lock can work correcly within *if statement* in 
> set_task_cpu(). Wrong?
> 
> > 
> > would no longer be serialized by rq->lock.
> > 
> > 	}
> > 	raw_spin_unlock(&rq->lock);
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-10 23:51         ` Byungchul Park
  2015-11-11 10:15           ` Byungchul Park
@ 2015-11-16 12:53           ` Peter Zijlstra
  2015-11-17  0:44             ` Byungchul Park
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-11-16 12:53 UTC (permalink / raw)
  To: Byungchul Park; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Wed, Nov 11, 2015 at 08:51:47AM +0900, Byungchul Park wrote:
> On Tue, Nov 10, 2015 at 01:16:47PM +0100, Peter Zijlstra wrote:
> > So the problem is that as soon as that ->cpu store comes through, the
> > other rq->lock can happen, even though we might still hold a rq->lock
> > thinking we're serialized.
> > 
> > Take for instance move_queued_tasks(), it does:
> > 
> > 	dequeue_task(rq, p, 0);
> > 	p->on_rq = TASK_ON_RQ_MIGRATING;
> > 	set_task_cpu(p, new_cpu) {
> > 	  __set_task_cpu();
> > 
> > ^^^ here holding rq->lock is insufficient and the below:
> > 
> > 	  p->sched_class->migrate_task_rq()
> 
> Thank you for explaning in detail, but this's why i asked you.

> Yes, rq->lock is insufficient in this place as you said, but
> should migrate_task_rq() be serialized by rq->lock? I might have
> agreed with you if the migrate_task_rq() should be serialized by
> rq->lock, but I think it's not the case. I think it would be of
> if task->pi_lock can work correcly within *if statement* in 
> set_task_cpu(). Wrong?

So currently, set_task_cpu() is serialized by:

 - p->pi_lock; on wakeup
 - rq->lock; otherwise

(see the #ifdef CONFIG_LOCKDEP comment in set_task_cpu())

This means that sched_class::migrate_task() cannot indeed rely on
rq->lock for full serialization, however it still means that
task_rq_lock() will fully serialize against the thing.

By changing this, it no longer will.

Even without that; I think such a change, if correct, is very fragile
and prone to creating problems later on, and sets bad precedent.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-16 12:53           ` Peter Zijlstra
@ 2015-11-17  0:44             ` Byungchul Park
  2015-11-17 11:21               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Byungchul Park @ 2015-11-17  0:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Mon, Nov 16, 2015 at 01:53:51PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 11, 2015 at 08:51:47AM +0900, Byungchul Park wrote:
> > On Tue, Nov 10, 2015 at 01:16:47PM +0100, Peter Zijlstra wrote:
> > > So the problem is that as soon as that ->cpu store comes through, the
> > > other rq->lock can happen, even though we might still hold a rq->lock
> > > thinking we're serialized.
> > > 
> > > Take for instance move_queued_tasks(), it does:
> > > 
> > > 	dequeue_task(rq, p, 0);
> > > 	p->on_rq = TASK_ON_RQ_MIGRATING;
> > > 	set_task_cpu(p, new_cpu) {
> > > 	  __set_task_cpu();
> > > 
> > > ^^^ here holding rq->lock is insufficient and the below:
> > > 
> > > 	  p->sched_class->migrate_task_rq()
> > 
> > Thank you for explaning in detail, but this's why i asked you.
> 
> > Yes, rq->lock is insufficient in this place as you said, but
> > should migrate_task_rq() be serialized by rq->lock? I might have
> > agreed with you if the migrate_task_rq() should be serialized by
> > rq->lock, but I think it's not the case. I think it would be of
> > if task->pi_lock can work correcly within *if statement* in 
> > set_task_cpu(). Wrong?
> 
> So currently, set_task_cpu() is serialized by:
> 
>  - p->pi_lock; on wakeup
>  - rq->lock; otherwise
> 
> (see the #ifdef CONFIG_LOCKDEP comment in set_task_cpu())

I already read the comment.. Then do you mean the comment above
migrate_task_rq_fair() is wrong and should be fixed? I thought the 
comment above migrate_task_rq_fair() is correct rather than
CONFIG_LOCKDEP comment in set_task_cpu(), when I read it. I think
these two comments are conflict each other a little bit, so one of
those should be fixed.

* the comment above migrate_task_rq_fair() describes it like,
Caller SHOULD HOLD (&p->pi_lock)

* the CONFIG_LOCKDEP comment in set_task_cpu() describes it like,
Caller SHOULD HOLD (&p->pi_lock || &rq->lock)

> 
> This means that sched_class::migrate_task() cannot indeed rely on
> rq->lock for full serialization, however it still means that
> task_rq_lock() will fully serialize against the thing.

Yes I also think this is true.

> 
> By changing this, it no longer will.

???

> 
> Even without that; I think such a change, if correct, is very fragile
> and prone to creating problems later on, and sets bad precedent.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-17  0:44             ` Byungchul Park
@ 2015-11-17 11:21               ` Peter Zijlstra
  2015-11-17 23:37                 ` Byungchul Park
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-11-17 11:21 UTC (permalink / raw)
  To: Byungchul Park; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Tue, Nov 17, 2015 at 09:44:16AM +0900, Byungchul Park wrote:

> > So currently, set_task_cpu() is serialized by:
> > 
> >  - p->pi_lock; on wakeup
> >  - rq->lock; otherwise
> > 
> > (see the #ifdef CONFIG_LOCKDEP comment in set_task_cpu())
> 
> I already read the comment.. Then do you mean the comment above
> migrate_task_rq_fair() is wrong and should be fixed? 

Looks that way, I'm not sure we always hold pi_lock there. But I'm low
on sleep, so I could have overlooked something.

See for example move_queued_task(), we call set_task_cpu() with rq->lock
held, but no pi_lock.

> I thought the comment above migrate_task_rq_fair() is correct rather
> than CONFIG_LOCKDEP comment in set_task_cpu(), when I read it. I think
> these two comments are conflict each other a little bit, so one of
> those should be fixed.

Agreed.

> * the comment above migrate_task_rq_fair() describes it like,
> Caller SHOULD HOLD (&p->pi_lock)
> 
> * the CONFIG_LOCKDEP comment in set_task_cpu() describes it like,
> Caller SHOULD HOLD (&p->pi_lock || &rq->lock)

Indeed.

> > 
> > This means that sched_class::migrate_task() cannot indeed rely on
> > rq->lock for full serialization, however it still means that
> > task_rq_lock() will fully serialize against the thing.
> 
> Yes I also think this is true.
> 
> > 
> > By changing this, it no longer will.
> 
> ???

I meant, if you call __set_task_cpu() before
sched_class::migrate_task_rq(), in that case task_rq_lock() will no
longer fully serialize against set_task_cpu().

Because once you've called __set_task_cpu(), task_rq_lock() will acquire
the _other_ rq->lock. And we cannot rely on our rq->lock to serialize
things.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-17 11:21               ` Peter Zijlstra
@ 2015-11-17 23:37                 ` Byungchul Park
  2015-11-17 23:55                   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Byungchul Park @ 2015-11-17 23:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Tue, Nov 17, 2015 at 12:21:49PM +0100, Peter Zijlstra wrote:
> 
> Looks that way, I'm not sure we always hold pi_lock there. But I'm low
> on sleep, so I could have overlooked something.
> 
> See for example move_queued_task(), we call set_task_cpu() with rq->lock
> held, but no pi_lock.

Indeed.

> 
> > I thought the comment above migrate_task_rq_fair() is correct rather
> > than CONFIG_LOCKDEP comment in set_task_cpu(), when I read it. I think
> > these two comments are conflict each other a little bit, so one of
> > those should be fixed.
> 
> Agreed.

Which one do you think to be fixed? The one above migrate_task_rq_fair()?
I wonder if it would be ok even it does not hold pi_lock in
migrate_task_rq_fair(). If you say *no problem*, I will try to fix the
comment.

> 
> I meant, if you call __set_task_cpu() before
> sched_class::migrate_task_rq(), in that case task_rq_lock() will no
> longer fully serialize against set_task_cpu().
> 
> Because once you've called __set_task_cpu(), task_rq_lock() will acquire
> the _other_ rq->lock. And we cannot rely on our rq->lock to serialize
> things.

I agree with you if migtrate_task_rq() can be serialized by rq->lock
without holding pi_lock. (even though I am still wondering..)

But I thought it was no problem if migrate_task_rq() was serialized only
by pi_lock as the comment above the migrate_task_rq() describes, because
breaking rq->lock does not affect the sericalization by pi_lock.

I would appreciate it if you would answer my questions.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-17 23:37                 ` Byungchul Park
@ 2015-11-17 23:55                   ` Peter Zijlstra
  2015-11-18  0:02                     ` Byungchul Park
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-11-17 23:55 UTC (permalink / raw)
  To: Byungchul Park; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Wed, Nov 18, 2015 at 08:37:00AM +0900, Byungchul Park wrote:

> Which one do you think to be fixed? The one above migrate_task_rq_fair()?
> I wonder if it would be ok even it does not hold pi_lock in
> migrate_task_rq_fair(). If you say *no problem*, I will try to fix the
> comment.

The one above migrate_task_rq_fair() is obviously broken, as
demonstrated by the move_queued_task() case.

Also, pretty much all runnable task migration code will not take
pi_lock, see also {pull,push}_{rt,dl}_task().

Note that this is done very much by design, task_rq_lock() is the thing
that fully serializes a task's scheduler state. Runnable tasks use
rq->lock, waking tasks use pi_lock.

> > I meant, if you call __set_task_cpu() before
> > sched_class::migrate_task_rq(), in that case task_rq_lock() will no
> > longer fully serialize against set_task_cpu().
> > 
> > Because once you've called __set_task_cpu(), task_rq_lock() will acquire
> > the _other_ rq->lock. And we cannot rely on our rq->lock to serialize
> > things.
> 
> I agree with you if migtrate_task_rq() can be serialized by rq->lock
> without holding pi_lock. (even though I am still wondering..)

move_queued_task() illustrates this.

> But I thought it was no problem if migrate_task_rq() was serialized only
> by pi_lock as the comment above the migrate_task_rq() describes, because
> breaking rq->lock does not affect the sericalization by pi_lock.

Right, but per the above, we cannot assume pi_lock is in fact held over
this.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once
  2015-11-17 23:55                   ` Peter Zijlstra
@ 2015-11-18  0:02                     ` Byungchul Park
  0 siblings, 0 replies; 16+ messages in thread
From: Byungchul Park @ 2015-11-18  0:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, yuyang.du, pjt, efault, tglx

On Wed, Nov 18, 2015 at 12:55:10AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 18, 2015 at 08:37:00AM +0900, Byungchul Park wrote:
> 
> > Which one do you think to be fixed? The one above migrate_task_rq_fair()?
> > I wonder if it would be ok even it does not hold pi_lock in
> > migrate_task_rq_fair(). If you say *no problem*, I will try to fix the
> > comment.
> 
> The one above migrate_task_rq_fair() is obviously broken, as
> demonstrated by the move_queued_task() case.
> 
> Also, pretty much all runnable task migration code will not take
> pi_lock, see also {pull,push}_{rt,dl}_task().
> 
> Note that this is done very much by design, task_rq_lock() is the thing
> that fully serializes a task's scheduler state. Runnable tasks use
> rq->lock, waking tasks use pi_lock.
> 
> > > I meant, if you call __set_task_cpu() before
> > > sched_class::migrate_task_rq(), in that case task_rq_lock() will no
> > > longer fully serialize against set_task_cpu().
> > > 
> > > Because once you've called __set_task_cpu(), task_rq_lock() will acquire
> > > the _other_ rq->lock. And we cannot rely on our rq->lock to serialize
> > > things.
> > 
> > I agree with you if migtrate_task_rq() can be serialized by rq->lock
> > without holding pi_lock. (even though I am still wondering..)
> 
> move_queued_task() illustrates this.
> 
> > But I thought it was no problem if migrate_task_rq() was serialized only
> > by pi_lock as the comment above the migrate_task_rq() describes, because
> > breaking rq->lock does not affect the sericalization by pi_lock.
> 
> Right, but per the above, we cannot assume pi_lock is in fact held over
> this.

Thank you.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:sched/core] sched/fair: Make it possible to account fair load avg consistently
  2015-10-23 16:16 ` [PATCH v4 1/3] sched/fair: make it possible to " byungchul.park
@ 2015-12-04 11:54   ` tip-bot for Byungchul Park
  0 siblings, 0 replies; 16+ messages in thread
From: tip-bot for Byungchul Park @ 2015-12-04 11:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, byungchul.park, peterz, mingo, efault, tglx, linux-kernel

Commit-ID:  ad936d8658fd348338cb7d42c577dac77892b074
Gitweb:     http://git.kernel.org/tip/ad936d8658fd348338cb7d42c577dac77892b074
Author:     Byungchul Park <byungchul.park@lge.com>
AuthorDate: Sat, 24 Oct 2015 01:16:19 +0900
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:34:42 +0100

sched/fair: Make it possible to account fair load avg consistently

The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.

This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.

These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  |  4 ++++
 kernel/sched/fair.c  | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h | 11 ++++++++++-
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8969a9a..32d83e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2120,6 +2120,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	p->se.cfs_rq			= NULL;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff8ec86..efd664c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2715,6 +2715,52 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 	}
 }
 
+/*
+ * Called within set_task_rq() right before setting a task's cpu. The
+ * caller only guarantees p->pi_lock is held; no other assumptions,
+ * including the state of rq->lock, should be made.
+ */
+void set_task_rq_fair(struct sched_entity *se,
+		      struct cfs_rq *prev, struct cfs_rq *next)
+{
+	if (!sched_feat(ATTACH_AGE_LOAD))
+		return;
+
+	/*
+	 * We are supposed to update the task to "current" time, then its up to
+	 * date and ready to go to new CPU/cfs_rq. But we have difficulty in
+	 * getting what current time is, so simply throw away the out-of-date
+	 * time. This will result in the wakee task is less decayed, but giving
+	 * the wakee more load sounds not bad.
+	 */
+	if (se->avg.last_update_time && prev) {
+		u64 p_last_update_time;
+		u64 n_last_update_time;
+
+#ifndef CONFIG_64BIT
+		u64 p_last_update_time_copy;
+		u64 n_last_update_time_copy;
+
+		do {
+			p_last_update_time_copy = prev->load_last_update_time_copy;
+			n_last_update_time_copy = next->load_last_update_time_copy;
+
+			smp_rmb();
+
+			p_last_update_time = prev->avg.last_update_time;
+			n_last_update_time = next->avg.last_update_time;
+
+		} while (p_last_update_time != p_last_update_time_copy ||
+			 n_last_update_time != n_last_update_time_copy);
+#else
+		p_last_update_time = prev->avg.last_update_time;
+		n_last_update_time = next->avg.last_update_time;
+#endif
+		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
+				  &se->avg, 0, 0, NULL);
+		se->avg.last_update_time = n_last_update_time;
+	}
+}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cdae23d..9a029fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,7 +335,15 @@ extern void sched_move_task(struct task_struct *tsk);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
-#endif
+
+#ifdef CONFIG_SMP
+extern void set_task_rq_fair(struct sched_entity *se,
+			     struct cfs_rq *prev, struct cfs_rq *next);
+#else /* !CONFIG_SMP */
+static inline void set_task_rq_fair(struct sched_entity *se,
+			     struct cfs_rq *prev, struct cfs_rq *next) { }
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #else /* CONFIG_CGROUP_SCHED */
 
@@ -933,6 +941,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
 #endif

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-12-04 11:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 16:16 [PATCH v4 0/3] sched: account fair load avg consistently byungchul.park
2015-10-23 16:16 ` [PATCH v4 1/3] sched/fair: make it possible to " byungchul.park
2015-12-04 11:54   ` [tip:sched/core] sched/fair: Make " tip-bot for Byungchul Park
2015-10-23 16:16 ` [PATCH v4 2/3] sched/fair: split the remove_entity_load_avg() into two functions byungchul.park
2015-10-23 16:16 ` [PATCH v4 3/3] sched: optimize migration by forcing rmb() and updating to be called once byungchul.park
2015-11-09 13:29   ` Peter Zijlstra
2015-11-10  1:09     ` Byungchul Park
2015-11-10 12:16       ` Peter Zijlstra
2015-11-10 23:51         ` Byungchul Park
2015-11-11 10:15           ` Byungchul Park
2015-11-16 12:53           ` Peter Zijlstra
2015-11-17  0:44             ` Byungchul Park
2015-11-17 11:21               ` Peter Zijlstra
2015-11-17 23:37                 ` Byungchul Park
2015-11-17 23:55                   ` Peter Zijlstra
2015-11-18  0:02                     ` Byungchul Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.