linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: vruntime should normalize when switching from fair
@ 2018-08-17 18:27 Steve Muckle
  2018-08-20 23:54 ` Miguel de Dios
  0 siblings, 1 reply; 29+ messages in thread
From: Steve Muckle @ 2018-08-17 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias, Steve Muckle

From: John Dias <joaodias@google.com>

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
  the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
  the vruntime_normalized check sees that the task's state is TASK_WAKING,
  which results in skipping the subtraction of the rq's min_vruntime
  from the task's vruntime
- later, when the prio is deboosted and the task is moved back
  to the fair class, the fair rq's min_vruntime is added to
  the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	 * - A task which has been woken up by try_to_wake_up() and
 	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 */
-	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+	if (!se->sum_exec_runtime ||
+	    (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
 		return true;
 
 	return false;
-- 
2.18.0.865.gffc8e1a3cd6-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-17 18:27 [PATCH] sched/fair: vruntime should normalize when switching from fair Steve Muckle
@ 2018-08-20 23:54 ` Miguel de Dios
  2018-08-23 16:52   ` Dietmar Eggemann
  2018-08-24  9:32   ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Miguel de Dios @ 2018-08-20 23:54 UTC (permalink / raw)
  To: Steve Muckle, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

On 08/17/2018 11:27 AM, Steve Muckle wrote:
> From: John Dias <joaodias@google.com>
>
> When rt_mutex_setprio changes a task's scheduling class to RT,
> we're seeing cases where the task's vruntime is not updated
> correctly upon return to the fair class.
> Specifically, the following is being observed:
> - task is deactivated while still in the fair class
> - task is boosted to RT via rt_mutex_setprio, which changes
>    the task to RT and calls check_class_changed.
> - check_class_changed leads to detach_task_cfs_rq, at which point
>    the vruntime_normalized check sees that the task's state is TASK_WAKING,
>    which results in skipping the subtraction of the rq's min_vruntime
>    from the task's vruntime
> - later, when the prio is deboosted and the task is moved back
>    to the fair class, the fair rq's min_vruntime is added to
>    the task's vruntime, even though it wasn't subtracted earlier.
> The immediate result is inflation of the task's vruntime, giving
> it lower priority (starving it if there's enough available work).
> The longer-term effect is inflation of all vruntimes because the
> task's vruntime becomes the rq's min_vruntime when the higher
> priority tasks go idle. That leads to a vicious cycle, where
> the vruntime inflation repeatedly doubled.
>
> The change here is to detect when vruntime_normalized is being
> called when the task is waking but is waking in another class,
> and to conclude that this is a case where vruntime has not
> been normalized.
>
> Signed-off-by: John Dias <joaodias@google.com>
> Signed-off-by: Steve Muckle <smuckle@google.com>
> ---
>   kernel/sched/fair.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb596f6c1..14011d7929d8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p)
>   	 * - A task which has been woken up by try_to_wake_up() and
>   	 *   waiting for actually being woken up by sched_ttwu_pending().
>   	 */
> -	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> +	if (!se->sum_exec_runtime ||
> +	    (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
>   		return true;
>   
>   	return false;
The normalization of vruntime used to exist in task_waking but it was 
removed and the normalization was moved into migrate_task_rq_fair. The 
reasoning being that task_waking_fair was only hit when a task is queued 
onto a different core and migrate_task_rq_fair should do the same work.

However, we're finding that there's one case which migrate_task_rq_fair 
doesn't hit: that being the case where rt_mutex_setprio changes a task's 
scheduling class to RT when its scheduled out. The task never hits 
migrate_task_rq_fair because it is switched to RT and migrates as an RT 
task. Because of this we're getting an unbounded addition of 
min_vruntime when the task is re-attached to the CFS runqueue when it 
loses the inherited priority. The patch above works because now the 
kernel specifically checks for this case and normalizes accordingly.

Here's the patch I was talking about: 
https://lore.kernel.org/patchwork/patch/677689/. In our testing we were 
seeing vruntimes nearly double every time after rt_mutex_setprio boosts 
the task to RT.

Signed-off-by: Miguel de Dios <migueldedios@google.com>
Tested-by: Miguel de Dios <migueldedios@google.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-20 23:54 ` Miguel de Dios
@ 2018-08-23 16:52   ` Dietmar Eggemann
  2018-08-24  6:54     ` Juri Lelli
  2018-08-24  9:32   ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-08-23 16:52 UTC (permalink / raw)
  To: Miguel de Dios, Steve Muckle, Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

Hi,

On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>> From: John Dias <joaodias@google.com>
>>
>> When rt_mutex_setprio changes a task's scheduling class to RT,
>> we're seeing cases where the task's vruntime is not updated
>> correctly upon return to the fair class.
>> Specifically, the following is being observed:
>> - task is deactivated while still in the fair class
>> - task is boosted to RT via rt_mutex_setprio, which changes
>>    the task to RT and calls check_class_changed.
>> - check_class_changed leads to detach_task_cfs_rq, at which point
>>    the vruntime_normalized check sees that the task's state is 
>> TASK_WAKING,
>>    which results in skipping the subtraction of the rq's min_vruntime
>>    from the task's vruntime
>> - later, when the prio is deboosted and the task is moved back
>>    to the fair class, the fair rq's min_vruntime is added to
>>    the task's vruntime, even though it wasn't subtracted earlier.
>> The immediate result is inflation of the task's vruntime, giving
>> it lower priority (starving it if there's enough available work).
>> The longer-term effect is inflation of all vruntimes because the
>> task's vruntime becomes the rq's min_vruntime when the higher
>> priority tasks go idle. That leads to a vicious cycle, where
>> the vruntime inflation repeatedly doubled.
>>
>> The change here is to detect when vruntime_normalized is being
>> called when the task is waking but is waking in another class,
>> and to conclude that this is a case where vruntime has not
>> been normalized.
>>
>> Signed-off-by: John Dias <joaodias@google.com>
>> Signed-off-by: Steve Muckle <smuckle@google.com>
>> ---
>>   kernel/sched/fair.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b39fb596f6c1..14011d7929d8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct 
>> task_struct *p)
>>        * - A task which has been woken up by try_to_wake_up() and
>>        *   waiting for actually being woken up by sched_ttwu_pending().
>>        */
>> -    if (!se->sum_exec_runtime || p->state == TASK_WAKING)
>> +    if (!se->sum_exec_runtime ||
>> +        (p->state == TASK_WAKING && p->sched_class == 
>> &fair_sched_class))
>>           return true;
>>       return false;
> The normalization of vruntime used to exist in task_waking but it was 
> removed and the normalization was moved into migrate_task_rq_fair. The 
> reasoning being that task_waking_fair was only hit when a task is queued 
> onto a different core and migrate_task_rq_fair should do the same work.
> 
> However, we're finding that there's one case which migrate_task_rq_fair 
> doesn't hit: that being the case where rt_mutex_setprio changes a task's 
> scheduling class to RT when its scheduled out. The task never hits 
> migrate_task_rq_fair because it is switched to RT and migrates as an RT 
> task. Because of this we're getting an unbounded addition of 
> min_vruntime when the task is re-attached to the CFS runqueue when it 
> loses the inherited priority. The patch above works because now the 
> kernel specifically checks for this case and normalizes accordingly.
> 
> Here's the patch I was talking about: 
> https://lore.kernel.org/patchwork/patch/677689/. In our testing we were 
> seeing vruntimes nearly double every time after rt_mutex_setprio boosts 
> the task to RT.
> 
> Signed-off-by: Miguel de Dios <migueldedios@google.com>
> Tested-by: Miguel de Dios <migueldedios@google.com>

I tried to catch this issue on my Arm64 Juno board using pi_test (and a 
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from 
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
      Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
       High thread SCHED_FIFO priority 3
        Med thread SCHED_FIFO priority 2
        Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so 
dequeue_task_fair()->dequeue_entity(), which subtracts 
cfs_rq->min_vruntime from se->vruntime, is called on it before it gets 
the rt prio.

Maybe it requires a very specific use of the pthread library to provoke 
this issue by making sure that the cfs tasks really blocks/sleeps?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-23 16:52   ` Dietmar Eggemann
@ 2018-08-24  6:54     ` Juri Lelli
  2018-08-24 21:17       ` Steve Muckle
  2018-09-06 23:25       ` Dietmar Eggemann
  0 siblings, 2 replies; 29+ messages in thread
From: Juri Lelli @ 2018-08-24  6:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Miguel de Dios, Steve Muckle, Peter Zijlstra, Ingo Molnar,
	linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

On 23/08/18 18:52, Dietmar Eggemann wrote:
> Hi,
> 
> On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > From: John Dias <joaodias@google.com>
> > > 
> > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > we're seeing cases where the task's vruntime is not updated
> > > correctly upon return to the fair class.
> > > Specifically, the following is being observed:
> > > - task is deactivated while still in the fair class
> > > - task is boosted to RT via rt_mutex_setprio, which changes
> > >    the task to RT and calls check_class_changed.
> > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > >    the vruntime_normalized check sees that the task's state is
> > > TASK_WAKING,
> > >    which results in skipping the subtraction of the rq's min_vruntime
> > >    from the task's vruntime
> > > - later, when the prio is deboosted and the task is moved back
> > >    to the fair class, the fair rq's min_vruntime is added to
> > >    the task's vruntime, even though it wasn't subtracted earlier.
> > > The immediate result is inflation of the task's vruntime, giving
> > > it lower priority (starving it if there's enough available work).
> > > The longer-term effect is inflation of all vruntimes because the
> > > task's vruntime becomes the rq's min_vruntime when the higher
> > > priority tasks go idle. That leads to a vicious cycle, where
> > > the vruntime inflation repeatedly doubled.
> > > 
> > > The change here is to detect when vruntime_normalized is being
> > > called when the task is waking but is waking in another class,
> > > and to conclude that this is a case where vruntime has not
> > > been normalized.
> > > 
> > > Signed-off-by: John Dias <joaodias@google.com>
> > > Signed-off-by: Steve Muckle <smuckle@google.com>
> > > ---
> > >   kernel/sched/fair.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b39fb596f6c1..14011d7929d8 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct
> > > task_struct *p)
> > >        * - A task which has been woken up by try_to_wake_up() and
> > >        *   waiting for actually being woken up by sched_ttwu_pending().
> > >        */
> > > -    if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> > > +    if (!se->sum_exec_runtime ||
> > > +        (p->state == TASK_WAKING && p->sched_class ==
> > > &fair_sched_class))
> > >           return true;
> > >       return false;
> > The normalization of vruntime used to exist in task_waking but it was
> > removed and the normalization was moved into migrate_task_rq_fair. The
> > reasoning being that task_waking_fair was only hit when a task is queued
> > onto a different core and migrate_task_rq_fair should do the same work.
> > 
> > However, we're finding that there's one case which migrate_task_rq_fair
> > doesn't hit: that being the case where rt_mutex_setprio changes a task's
> > scheduling class to RT when its scheduled out. The task never hits
> > migrate_task_rq_fair because it is switched to RT and migrates as an RT
> > task. Because of this we're getting an unbounded addition of
> > min_vruntime when the task is re-attached to the CFS runqueue when it
> > loses the inherited priority. The patch above works because now the
> > kernel specifically checks for this case and normalizes accordingly.
> > 
> > Here's the patch I was talking about:
> > https://lore.kernel.org/patchwork/patch/677689/. In our testing we were
> > seeing vruntimes nearly double every time after rt_mutex_setprio boosts
> > the task to RT.
> > 
> > Signed-off-by: Miguel de Dios <migueldedios@google.com>
> > Tested-by: Miguel de Dios <migueldedios@google.com>
> 
> I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> rt-tests but wasn't able to do so.
> 
> # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
> 
> Starting PI Stress Test
> Number of thread groups: 1
> Duration of test run: 1 seconds
> Number of inversions per group: 1
>      Admin thread SCHED_FIFO priority 4
> 1 groups of 3 threads will be created
>       High thread SCHED_FIFO priority 3
>        Med thread SCHED_FIFO priority 2
>        Low thread SCHED_OTHER nice 0
> 
> # ./pip_stress
> 
> In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
> from se->vruntime, is called on it before it gets the rt prio.
> 
> Maybe it requires a very specific use of the pthread library to provoke this
> issue by making sure that the cfs tasks really blocks/sleeps?

Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459

Best,

- Juri

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-20 23:54 ` Miguel de Dios
  2018-08-23 16:52   ` Dietmar Eggemann
@ 2018-08-24  9:32   ` Peter Zijlstra
  2018-08-24  9:47     ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-24  9:32 UTC (permalink / raw)
  To: Miguel de Dios
  Cc: Steve Muckle, Ingo Molnar, linux-kernel, kernel-team, Todd Kjos,
	Paul Turner, Quentin Perret, Patrick Bellasi, Chris Redpath,
	Morten Rasmussen, John Dias

On Mon, Aug 20, 2018 at 04:54:25PM -0700, Miguel de Dios wrote:
> On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > From: John Dias <joaodias@google.com>
> > 
> > When rt_mutex_setprio changes a task's scheduling class to RT,
> > we're seeing cases where the task's vruntime is not updated
> > correctly upon return to the fair class.
> > Specifically, the following is being observed:
> > - task is deactivated while still in the fair class
> > - task is boosted to RT via rt_mutex_setprio, which changes
> >    the task to RT and calls check_class_changed.
> > - check_class_changed leads to detach_task_cfs_rq, at which point
> >    the vruntime_normalized check sees that the task's state is TASK_WAKING,
> >    which results in skipping the subtraction of the rq's min_vruntime
> >    from the task's vruntime
> > - later, when the prio is deboosted and the task is moved back
> >    to the fair class, the fair rq's min_vruntime is added to
> >    the task's vruntime, even though it wasn't subtracted earlier.
> > The immediate result is inflation of the task's vruntime, giving
> > it lower priority (starving it if there's enough available work).
> > The longer-term effect is inflation of all vruntimes because the
> > task's vruntime becomes the rq's min_vruntime when the higher
> > priority tasks go idle. That leads to a vicious cycle, where
> > the vruntime inflation repeatedly doubled.
> > 
> > The change here is to detect when vruntime_normalized is being
> > called when the task is waking but is waking in another class,
> > and to conclude that this is a case where vruntime has not
> > been normalized.
> > 
> > Signed-off-by: John Dias <joaodias@google.com>
> > Signed-off-by: Steve Muckle <smuckle@google.com>
> > ---
> >   kernel/sched/fair.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b39fb596f6c1..14011d7929d8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p)
> >   	 * - A task which has been woken up by try_to_wake_up() and
> >   	 *   waiting for actually being woken up by sched_ttwu_pending().
> >   	 */
> > -	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> > +	if (!se->sum_exec_runtime ||
> > +	    (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
> >   		return true;
> >   	return false;

> The normalization of vruntime used to exist in task_waking but it was
> removed and the normalization was moved into migrate_task_rq_fair. The
> reasoning being that task_waking_fair was only hit when a task is queued
> onto a different core and migrate_task_rq_fair should do the same work.
> 
> However, we're finding that there's one case which migrate_task_rq_fair
> doesn't hit: that being the case where rt_mutex_setprio changes a task's
> scheduling class to RT when its scheduled out. The task never hits
> migrate_task_rq_fair because it is switched to RT and migrates as an RT
> task. Because of this we're getting an unbounded addition of min_vruntime
> when the task is re-attached to the CFS runqueue when it loses the inherited
> priority. The patch above works because now the kernel specifically checks
> for this case and normalizes accordingly.
> 
> Here's the patch I was talking about:
> https://lore.kernel.org/patchwork/patch/677689/. In our testing we were
> seeing vruntimes nearly double every time after rt_mutex_setprio boosts the
> task to RT.

Bah, patchwork is such shit... how do you get to the previus patch from
there? Because I think 2/3 is the actual commit that changed things, 3/3
just cleans up a bit.

That would be commit:

  b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

But I'm still somewhat confused; how would task_waking_fair() have
helped if we're already changed to a different class?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-24  9:32   ` Peter Zijlstra
@ 2018-08-24  9:47     ` Peter Zijlstra
  2018-08-24 21:24       ` Steve Muckle
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-24  9:47 UTC (permalink / raw)
  To: Miguel de Dios
  Cc: Steve Muckle, Ingo Molnar, linux-kernel, kernel-team, Todd Kjos,
	Paul Turner, Quentin Perret, Patrick Bellasi, Chris Redpath,
	Morten Rasmussen, John Dias

> > On 08/17/2018 11:27 AM, Steve Muckle wrote:

> > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > we're seeing cases where the task's vruntime is not updated
> > > correctly upon return to the fair class.

> > > Specifically, the following is being observed:
> > > - task is deactivated while still in the fair class
> > > - task is boosted to RT via rt_mutex_setprio, which changes
> > >    the task to RT and calls check_class_changed.
> > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > >    the vruntime_normalized check sees that the task's state is TASK_WAKING,
> > >    which results in skipping the subtraction of the rq's min_vruntime
> > >    from the task's vruntime
> > > - later, when the prio is deboosted and the task is moved back
> > >    to the fair class, the fair rq's min_vruntime is added to
> > >    the task's vruntime, even though it wasn't subtracted earlier.

I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-24  6:54     ` Juri Lelli
@ 2018-08-24 21:17       ` Steve Muckle
  2018-09-06 23:25       ` Dietmar Eggemann
  1 sibling, 0 replies; 29+ messages in thread
From: Steve Muckle @ 2018-08-24 21:17 UTC (permalink / raw)
  To: Juri Lelli, Dietmar Eggemann
  Cc: Miguel de Dios, Peter Zijlstra, Ingo Molnar, linux-kernel,
	kernel-team, Todd Kjos, Paul Turner, Quentin Perret,
	Patrick Bellasi, Chris Redpath, Morten Rasmussen, John Dias

On 08/23/2018 11:54 PM, Juri Lelli wrote:
>> I tried to catch this issue on my Arm64 Juno board using pi_test (and a
>> slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
>> rt-tests but wasn't able to do so.
>>
>> # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
>>
>> Starting PI Stress Test
>> Number of thread groups: 1
>> Duration of test run: 1 seconds
>> Number of inversions per group: 1
>>       Admin thread SCHED_FIFO priority 4
>> 1 groups of 3 threads will be created
>>        High thread SCHED_FIFO priority 3
>>         Med thread SCHED_FIFO priority 2
>>         Low thread SCHED_OTHER nice 0
>>
>> # ./pip_stress
>>
>> In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
>> dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
>> from se->vruntime, is called on it before it gets the rt prio.
>>
>> Maybe it requires a very specific use of the pthread library to provoke this
>> issue by making sure that the cfs tasks really blocks/sleeps?
> 
> Maybe one could play with rt-app to recreate such specific use case?
> 
> https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459

This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:

low-prio thread vruntime is 752ms
   pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]

low-prio thread waits on a_sem
   pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120

high prio thread finishes wakeup, then sleeps for 1ms
            <idle>-0     [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
            <idle>-0     [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
   pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120

high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
            <idle>-0     [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
            <idle>-0     [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
            <idle>-0     [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
   pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001

low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
   pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19

high-prio thread sleeps on pi_mutex
   pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120

low-prio thread finishes wakeup
            <idle>-0     [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
            <idle>-0     [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19

low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
   pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
   pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000

low-prio thread vruntime is 1505ms
   pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]

The timing is quite sensitive since the task being boosted has to be 
caught in the TASK_WAKING state. The program:

/*
   * Test case for vruntime management during rtmutex priority inheritance
   * promotion and demotion.
   *
   * build with -lpthread
   */

#define _GNU_SOURCE
#include <pthread.h>
#include <semaphore.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>

#define ERROR_CHECK(x) \
          if (x) \
                  fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
          int policy;
          int i = 0;
          cpu_set_t cpuset;

          CPU_ZERO(&cpuset);
          CPU_SET(0, &cpuset);
          ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),
                                             &cpuset));

          while(i++ < 100) {
                  sem_wait(&b_sem);
                  usleep(1000);
                  sem_post(&a_sem);
                  pthread_mutex_lock(&pi_mutex);
                  pthread_mutex_unlock(&pi_mutex);
          }
}

void *low_prio_thread_func(void *arg) {
          int i = 0;
          cpu_set_t cpuset;

          CPU_ZERO(&cpuset);
          CPU_SET(1, &cpuset);
          ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),
                                             &cpuset));
          while(i++ < 100) {
                  pthread_mutex_lock(&pi_mutex);
                  sem_post(&b_sem);
                  sem_wait(&a_sem);
                  pthread_mutex_unlock(&pi_mutex);
          }
}

int main()
{
          pthread_t rt_thread;
          pthread_t low_prio_thread;
          pthread_attr_t rt_thread_attrs;
          pthread_attr_t low_prio_thread_attrs;
          struct sched_param rt_thread_sched_params;
          struct sched_param low_prio_thread_sched_params;

          pthread_mutexattr_t mutex_attrs;

          sem_init(&a_sem, 0, 0);
          sem_init(&b_sem, 0, 0);

          ERROR_CHECK(pthread_mutexattr_init(&mutex_attrs));
          ERROR_CHECK(pthread_mutexattr_setprotocol(&mutex_attrs,
                                                    PTHREAD_PRIO_INHERIT));
          ERROR_CHECK(pthread_mutex_init(&pi_mutex, &mutex_attrs));

          rt_thread_sched_params.sched_priority = 80;
          low_prio_thread_sched_params.sched_priority = 0;

          pthread_attr_init(&rt_thread_attrs);
          pthread_attr_init(&low_prio_thread_attrs);

          ERROR_CHECK(pthread_attr_setinheritsched(&rt_thread_attrs,
                                                   PTHREAD_EXPLICIT_SCHED));
          ERROR_CHECK(pthread_attr_setinheritsched(&low_prio_thread_attrs,
                                                   PTHREAD_EXPLICIT_SCHED));

          ERROR_CHECK(pthread_attr_setschedpolicy(&rt_thread_attrs,
                                                  SCHED_FIFO));
          ERROR_CHECK(pthread_attr_setschedpolicy(&low_prio_thread_attrs,
                                                  SCHED_OTHER));

          ERROR_CHECK(pthread_attr_setschedparam(&rt_thread_attrs,
                                                 &rt_thread_sched_params));
          ERROR_CHECK(pthread_attr_setschedparam(&low_prio_thread_attrs,
  &low_prio_thread_sched_params));

          ERROR_CHECK(pthread_create(&rt_thread, &rt_thread_attrs,
                                     rt_thread_func, NULL));
          ERROR_CHECK(pthread_create(&low_prio_thread, 
&low_prio_thread_attrs,
                                     low_prio_thread_func, NULL));

          ERROR_CHECK(pthread_join(rt_thread, NULL));
          ERROR_CHECK(pthread_join(low_prio_thread, NULL));

          return 0;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-24  9:47     ` Peter Zijlstra
@ 2018-08-24 21:24       ` Steve Muckle
  2018-08-27 11:14         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Steve Muckle @ 2018-08-24 21:24 UTC (permalink / raw)
  To: Peter Zijlstra, Miguel de Dios
  Cc: Ingo Molnar, linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
> 
>>>> When rt_mutex_setprio changes a task's scheduling class to RT,
>>>> we're seeing cases where the task's vruntime is not updated
>>>> correctly upon return to the fair class.
> 
>>>> Specifically, the following is being observed:
>>>> - task is deactivated while still in the fair class
>>>> - task is boosted to RT via rt_mutex_setprio, which changes
>>>>     the task to RT and calls check_class_changed.
>>>> - check_class_changed leads to detach_task_cfs_rq, at which point
>>>>     the vruntime_normalized check sees that the task's state is TASK_WAKING,
>>>>     which results in skipping the subtraction of the rq's min_vruntime
>>>>     from the task's vruntime
>>>> - later, when the prio is deboosted and the task is moved back
>>>>     to the fair class, the fair rq's min_vruntime is added to
>>>>     the task's vruntime, even though it wasn't subtracted earlier.
> 
> I'm thinking that is an incomplete scenario; where do we get to
> TASK_WAKING.

Yes there's a missing bit of context here at the beginning that the task 
to be boosted had already been put into TASK_WAKING.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-24 21:24       ` Steve Muckle
@ 2018-08-27 11:14         ` Peter Zijlstra
  2018-08-28 14:53           ` Dietmar Eggemann
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-27 11:14 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Miguel de Dios, Ingo Molnar, linux-kernel, kernel-team,
	Todd Kjos, Paul Turner, Quentin Perret, Patrick Bellasi,
	Chris Redpath, Morten Rasmussen, John Dias

On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > 
> > > > > When rt_mutex_setprio changes a task's scheduling class to RT,
> > > > > we're seeing cases where the task's vruntime is not updated
> > > > > correctly upon return to the fair class.
> > 
> > > > > Specifically, the following is being observed:
> > > > > - task is deactivated while still in the fair class
> > > > > - task is boosted to RT via rt_mutex_setprio, which changes
> > > > >     the task to RT and calls check_class_changed.
> > > > > - check_class_changed leads to detach_task_cfs_rq, at which point
> > > > >     the vruntime_normalized check sees that the task's state is TASK_WAKING,
> > > > >     which results in skipping the subtraction of the rq's min_vruntime
> > > > >     from the task's vruntime
> > > > > - later, when the prio is deboosted and the task is moved back
> > > > >     to the fair class, the fair rq's min_vruntime is added to
> > > > >     the task's vruntime, even though it wasn't subtracted earlier.
> > 
> > I'm thinking that is an incomplete scenario; where do we get to
> > TASK_WAKING.
> 
> Yes there's a missing bit of context here at the beginning that the task to
> be boosted had already been put into TASK_WAKING.

See, I'm confused...

The only time TASK_WAKING is visible, is if we've done a remote wakeup
and it's 'stuck' on the remote wake_list. And in that case we've done
migrate_task_rq_fair() on it.

So by the time either rt_mutex_setprio() or __sched_setscheduler() get
to calling check_class_changed(), under both pi_lock and rq->lock, the
vruntime_normalized() thing should be right.

So please detail the exact scenario. Because I'm not seeing it.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-27 11:14         ` Peter Zijlstra
@ 2018-08-28 14:53           ` Dietmar Eggemann
  2018-08-29 10:54             ` Dietmar Eggemann
  2018-09-26  9:50             ` Wanpeng Li
  0 siblings, 2 replies; 29+ messages in thread
From: Dietmar Eggemann @ 2018-08-28 14:53 UTC (permalink / raw)
  To: Peter Zijlstra, Steve Muckle
  Cc: Miguel de Dios, Ingo Molnar, linux-kernel, kernel-team,
	Todd Kjos, Paul Turner, Quentin Perret, Patrick Bellasi,
	Chris Redpath, Morten Rasmussen, John Dias

On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>
>>>>>> When rt_mutex_setprio changes a task's scheduling class to RT,
>>>>>> we're seeing cases where the task's vruntime is not updated
>>>>>> correctly upon return to the fair class.
>>>
>>>>>> Specifically, the following is being observed:
>>>>>> - task is deactivated while still in the fair class
>>>>>> - task is boosted to RT via rt_mutex_setprio, which changes
>>>>>>      the task to RT and calls check_class_changed.
>>>>>> - check_class_changed leads to detach_task_cfs_rq, at which point
>>>>>>      the vruntime_normalized check sees that the task's state is TASK_WAKING,
>>>>>>      which results in skipping the subtraction of the rq's min_vruntime
>>>>>>      from the task's vruntime
>>>>>> - later, when the prio is deboosted and the task is moved back
>>>>>>      to the fair class, the fair rq's min_vruntime is added to
>>>>>>      the task's vruntime, even though it wasn't subtracted earlier.
>>>
>>> I'm thinking that is an incomplete scenario; where do we get to
>>> TASK_WAKING.
>>
>> Yes there's a missing bit of context here at the beginning that the task to
>> be boosted had already been put into TASK_WAKING.
> 
> See, I'm confused...
> 
> The only time TASK_WAKING is visible, is if we've done a remote wakeup
> and it's 'stuck' on the remote wake_list. And in that case we've done
> migrate_task_rq_fair() on it.
> 
> So by the time either rt_mutex_setprio() or __sched_setscheduler() get
> to calling check_class_changed(), under both pi_lock and rq->lock, the
> vruntime_normalized() thing should be right.
> 
> So please detail the exact scenario. Because I'm not seeing it.

Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
don't share LLC (e.g. CPU0 and CPU4 on hikey960).

So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.

...
rt_task-3579  [000] 35.391573: sched_waking:     comm=fair_task pid=3580 prio=120 target_cpu=004
rt_task-3579  [000] 35.391580: bprint:           try_to_wake_up: try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0
rt_task-3579  [000] 35.391584: bprint:           try_to_wake_up: ttwu_queue: task=fair_task pid=3580
rt_task-3579  [000] 35.391588: bprint:           try_to_wake_up: ttwu_queue_remote: task=fair_task pid=3580
rt_task-3579  [000] 35.391591: bprint:           try_to_wake_up: ttwu_queue_remote: cpu=4 smp_send_reschedule
rt_task-3579  [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 oldprio=120 newprio=19
rt_task-3579  [000] 35.391635: bprint:           rt_mutex_setprio: task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844
rt_task-3579  [000] 35.391641: bprint:           rt_mutex_setprio: p->prio set: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
rt_task-3579  [000] 35.391646: bprint:           rt_mutex_setprio: queued checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
rt_task-3579  [000] 35.391652: bprint:           rt_mutex_setprio: running checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
rt_task-3579  [000] 35.391657: bprint:           rt_mutex_setprio: fair_class=0xffff000008da2c80 rt_class=0xffff000008da2d70 prev_class=0xffff000008da2c80 p->sched_class=0xffff000008da2d70 oldprio=120 p->prio=19
rt_task-3579  [000] 35.391661: bprint:           detach_task_cfs_rq: task=fair_task pid=3580 cpu=4 vruntime_normalized=1
rt_task-3579  [000] 35.391706: sched_switch:     rt_task:3579 [19] D ==> swapper/0:0 [120] 
 <idle>-0     [004] 35.391828: bprint:           ttwu_do_activate: ttwu_do_activate: task=fair_task pid=3580
 <idle>-0     [004] 35.391832: bprint:           ttwu_do_activate: ttwu_activate: task=fair_task pid=3580
 <idle>-0     [004] 35.391833: bprint:           ttwu_do_wakeup: ttwu_do_wakeup: task=fair_task pid=3580
 <idle>-0     [004] 35.391834: sched_wakeup:     fair_task:3580 [19] success=1 CPU:004

It doesn't happen on hikey960 when I use two cpus of the same LLC or on my
laptop (i7-4750HQ).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-28 14:53           ` Dietmar Eggemann
@ 2018-08-29 10:54             ` Dietmar Eggemann
  2018-08-29 11:59               ` Peter Zijlstra
  2018-09-26  9:50             ` Wanpeng Li
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-08-29 10:54 UTC (permalink / raw)
  To: Peter Zijlstra, Steve Muckle
  Cc: Miguel de Dios, Ingo Molnar, linux-kernel, kernel-team,
	Todd Kjos, Paul Turner, Quentin Perret, Patrick Bellasi,
	Chris Redpath, Morten Rasmussen, John Dias

On 08/28/2018 03:53 PM, Dietmar Eggemann wrote:
> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>>
>>>>>>> When rt_mutex_setprio changes a task's scheduling class to RT,
>>>>>>> we're seeing cases where the task's vruntime is not updated
>>>>>>> correctly upon return to the fair class.
>>>>
>>>>>>> Specifically, the following is being observed:
>>>>>>> - task is deactivated while still in the fair class
>>>>>>> - task is boosted to RT via rt_mutex_setprio, which changes
>>>>>>>       the task to RT and calls check_class_changed.
>>>>>>> - check_class_changed leads to detach_task_cfs_rq, at which point
>>>>>>>       the vruntime_normalized check sees that the task's state is TASK_WAKING,
>>>>>>>       which results in skipping the subtraction of the rq's min_vruntime
>>>>>>>       from the task's vruntime
>>>>>>> - later, when the prio is deboosted and the task is moved back
>>>>>>>       to the fair class, the fair rq's min_vruntime is added to
>>>>>>>       the task's vruntime, even though it wasn't subtracted earlier.
>>>>
>>>> I'm thinking that is an incomplete scenario; where do we get to
>>>> TASK_WAKING.
>>>
>>> Yes there's a missing bit of context here at the beginning that the task to
>>> be boosted had already been put into TASK_WAKING.
>>
>> See, I'm confused...
>>
>> The only time TASK_WAKING is visible, is if we've done a remote wakeup
>> and it's 'stuck' on the remote wake_list. And in that case we've done
>> migrate_task_rq_fair() on it.
>>
>> So by the time either rt_mutex_setprio() or __sched_setscheduler() get
>> to calling check_class_changed(), under both pi_lock and rq->lock, the
>> vruntime_normalized() thing should be right.
>>
>> So please detail the exact scenario. Because I'm not seeing it.
> 
> Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
> issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
> don't share LLC (e.g. CPU0 and CPU4 on hikey960).
> 
> So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.

I forgot to mention that since fair_task's cpu affinity is restricted to 
CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since 
if (task_cpu(p) != cpu) fails.

I think the combination of cpu affinity of the fair_task to CPU4 and the 
fact that the scheduler runs on CPU1 when waking fair_task (with the two 
cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in 
which this vruntime issue can happen.

[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-29 10:54             ` Dietmar Eggemann
@ 2018-08-29 11:59               ` Peter Zijlstra
  2018-08-29 15:33                 ` Dietmar Eggemann
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2018-08-29 11:59 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steve Muckle, Miguel de Dios, Ingo Molnar, linux-kernel,
	kernel-team, Todd Kjos, Paul Turner, Quentin Perret,
	Patrick Bellasi, Chris Redpath, Morten Rasmussen, John Dias

On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote:
> I forgot to mention that since fair_task's cpu affinity is restricted to
> CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if
> (task_cpu(p) != cpu) fails.
>
> I think the combination of cpu affinity of the fair_task to CPU4 and the
> fact that the scheduler runs on CPU1 when waking fair_task (with the two
> cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which
> this vruntime issue can happen.

Ohhh, D'0h. A remote wakeup that doesn't migrate.

That would suggest something like so:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p)
 	 * - A task which has been woken up by try_to_wake_up() and
 	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 */
-	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+	if (!se->sum_exec_runtime ||
+	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
 		return true;
 
 	return false;




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-29 11:59               ` Peter Zijlstra
@ 2018-08-29 15:33                 ` Dietmar Eggemann
  2018-08-31 22:24                   ` Steve Muckle
  0 siblings, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-08-29 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steve Muckle, Miguel de Dios, Ingo Molnar, linux-kernel,
	kernel-team, Todd Kjos, Paul Turner, Quentin Perret,
	Patrick Bellasi, Chris Redpath, Morten Rasmussen, John Dias

On 08/29/2018 12:59 PM, Peter Zijlstra wrote:
> On Wed, Aug 29, 2018 at 11:54:58AM +0100, Dietmar Eggemann wrote:
>> I forgot to mention that since fair_task's cpu affinity is restricted to
>> CPU4, there is no call to set_task_cpu()->migrate_task_rq_fair() since if
>> (task_cpu(p) != cpu) fails.
>>
>> I think the combination of cpu affinity of the fair_task to CPU4 and the
>> fact that the scheduler runs on CPU1 when waking fair_task (with the two
>> cpus not sharing LLC) while TTWU_QUEUE is enabled is the situation in which
>> this vruntime issue can happen.
> 
> Ohhh, D'0h. A remote wakeup that doesn't migrate.

Ah, there is this WF_MIGRATED flag, perfect for the distinction whether 
a task migrated or not.

> That would suggest something like so:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb596f6c1..b3b62cf37fb6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct *p)
>   	 * - A task which has been woken up by try_to_wake_up() and
>   	 *   waiting for actually being woken up by sched_ttwu_pending().
>   	 */
> -	if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> +	if (!se->sum_exec_runtime ||
> +	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
>   		return true;
>   
>   	return false;
Yes, this solves the issue for the case I described. Using 
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using 
'p->sched_class == &fair_sched_class'.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-29 15:33                 ` Dietmar Eggemann
@ 2018-08-31 22:24                   ` Steve Muckle
  0 siblings, 0 replies; 29+ messages in thread
From: Steve Muckle @ 2018-08-31 22:24 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: Miguel de Dios, Ingo Molnar, linux-kernel, kernel-team,
	Todd Kjos, Paul Turner, Quentin Perret, Patrick Bellasi,
	Chris Redpath, Morten Rasmussen, John Dias

On 08/29/2018 08:33 AM, Dietmar Eggemann wrote:
> Yes, this solves the issue for the case I described. Using
> 'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using
> 'p->sched_class == &fair_sched_class'.

It's confirmed that this patch solves the original issue we saw (and my 
test case passes for me as well). I will send this version out shortly.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-24  6:54     ` Juri Lelli
  2018-08-24 21:17       ` Steve Muckle
@ 2018-09-06 23:25       ` Dietmar Eggemann
  2018-09-07  7:16         ` Juri Lelli
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-06 23:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Miguel de Dios, Steve Muckle, Peter Zijlstra, Ingo Molnar,
	linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

Hi Juri,

On 08/23/2018 11:54 PM, Juri Lelli wrote:
> On 23/08/18 18:52, Dietmar Eggemann wrote:
>> Hi,
>>
>> On 08/21/2018 01:54 AM, Miguel de Dios wrote:
>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>> From: John Dias <joaodias@google.com>

[...]

>>
>> I tried to catch this issue on my Arm64 Juno board using pi_test (and a
>> slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
>> rt-tests but wasn't able to do so.
>>
>> # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
>>
>> Starting PI Stress Test
>> Number of thread groups: 1
>> Duration of test run: 1 seconds
>> Number of inversions per group: 1
>>       Admin thread SCHED_FIFO priority 4
>> 1 groups of 3 threads will be created
>>        High thread SCHED_FIFO priority 3
>>         Med thread SCHED_FIFO priority 2
>>         Low thread SCHED_OTHER nice 0
>>
>> # ./pip_stress
>>
>> In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
>> dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
>> from se->vruntime, is called on it before it gets the rt prio.
>>
>> Maybe it requires a very specific use of the pthread library to provoke this
>> issue by making sure that the cfs tasks really blocks/sleeps?
> 
> Maybe one could play with rt-app to recreate such specific use case?
> 
> https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459

I played a little bit with rt-app on hikey960 to re-create Steve's test 
program.
Since there is no semaphore support (sem_wait(), sem_post()) I used 
condition variables (wait: pthread_cond_wait() , signal: 
pthread_cond_signal()). It's not really the same since this is stateless 
but sleeps before the signals help to maintain the state in this easy 
example.

This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:


     "global": {
         "calibration" : 130,
	"pi_enabled" : true
     },
     "tasks": {
         "rt_task": {
	    "loop" : 100,
	    "policy" : "SCHED_FIFO",
	    "cpus" : [0],

	    "lock" : "b_mutex",
	    "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
	    "unlock" : "b_mutex",
	    "sleep" : 3000,
	    "lock1" : "a_mutex",
	    "signal" : "a_cond",
	    "unlock1" : "a_mutex",
	    "lock2" : "pi-mutex",
	    "unlock2" : "pi-mutex"
         },
	"cfs_task": {
	    "loop" : 100,
	    "policy" : "SCHED_OTHER",
	    "cpus" : [4],

	    "lock" : "pi-mutex",
	    "sleep" : 3000,
	    "lock1" : "b_mutex",
	    "signal" : "b_cond",
	    "unlock" : "b_mutex",
	    "lock2" : "a_mutex",
	    "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
	    "unlock1" : "a_mutex",
	    "unlock2" : "pi-mutex"
	}
     }
}

Adding semaphores is possible but rt-app has no easy way to initialize 
individual objects, e.g. sem_init(..., value). The only way I see is via 
the global section, like "pi_enabled". But then, this is true for all 
objects of this kind (in this case mutexes)?

So the following couple of lines extension to rt-app works because both 
semaphores can be initialized to 0:

  {
     "global": {
         "calibration" : 130,
	"pi_enabled" : true
     },
     "tasks": {
         "rt_task": {
	    "loop" : 100,
	    "policy" : "SCHED_FIFO",
	    "cpus" : [0],

	    "sem_wait" : "b_sem",
	    "sleep" : 1000,
	    "sem_post" : "a_sem",

	    "lock" : "pi-mutex",
	    "unlock" : "pi-mutex"
         },
	"cfs_task": {
	    "loop" : 100,
	    "policy" : "SCHED_OTHER",
	    "cpus" : [4],

	    "lock" : "pi-mutex",
	    "sleep" : 1000,
	    "sem_post" : "b_sem",
	    "sem_wait" : "a_sem",
	    "unlock" : "pi-mutex"
	}
     }
}

Any thoughts on that? I can see something like this as infrastructure to 
create a regression test case based on rt-app and standard ftrace.

[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-06 23:25       ` Dietmar Eggemann
@ 2018-09-07  7:16         ` Juri Lelli
  2018-09-07  7:58           ` Vincent Guittot
  0 siblings, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2018-09-07  7:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Miguel de Dios, Steve Muckle, Peter Zijlstra, Ingo Molnar,
	linux-kernel, kernel-team, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

On 06/09/18 16:25, Dietmar Eggemann wrote:
> Hi Juri,
> 
> On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > Hi,
> > > 
> > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > From: John Dias <joaodias@google.com>
> 
> [...]
> 
> > > 
> > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > rt-tests but wasn't able to do so.
> > > 
> > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
> > > 
> > > Starting PI Stress Test
> > > Number of thread groups: 1
> > > Duration of test run: 1 seconds
> > > Number of inversions per group: 1
> > >       Admin thread SCHED_FIFO priority 4
> > > 1 groups of 3 threads will be created
> > >        High thread SCHED_FIFO priority 3
> > >         Med thread SCHED_FIFO priority 2
> > >         Low thread SCHED_OTHER nice 0
> > > 
> > > # ./pip_stress
> > > 
> > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
> > > from se->vruntime, is called on it before it gets the rt prio.
> > > 
> > > Maybe it requires a very specific use of the pthread library to provoke this
> > > issue by making sure that the cfs tasks really blocks/sleeps?
> > 
> > Maybe one could play with rt-app to recreate such specific use case?
> > 
> > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> 
> I played a little bit with rt-app on hikey960 to re-create Steve's test
> program.

Oh, nice! Thanks for sharing what you have got.

> Since there is no semaphore support (sem_wait(), sem_post()) I used
> condition variables (wait: pthread_cond_wait() , signal:
> pthread_cond_signal()). It's not really the same since this is stateless but
> sleeps before the signals help to maintain the state in this easy example.
> 
> This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> 
> 
>     "global": {
>         "calibration" : 130,
> 	"pi_enabled" : true
>     },
>     "tasks": {
>         "rt_task": {
> 	    "loop" : 100,
> 	    "policy" : "SCHED_FIFO",
> 	    "cpus" : [0],
> 
> 	    "lock" : "b_mutex",
> 	    "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
> 	    "unlock" : "b_mutex",
> 	    "sleep" : 3000,
> 	    "lock1" : "a_mutex",
> 	    "signal" : "a_cond",
> 	    "unlock1" : "a_mutex",
> 	    "lock2" : "pi-mutex",
> 	    "unlock2" : "pi-mutex"
>         },
> 	"cfs_task": {
> 	    "loop" : 100,
> 	    "policy" : "SCHED_OTHER",
> 	    "cpus" : [4],
> 
> 	    "lock" : "pi-mutex",
> 	    "sleep" : 3000,
> 	    "lock1" : "b_mutex",
> 	    "signal" : "b_cond",
> 	    "unlock" : "b_mutex",
> 	    "lock2" : "a_mutex",
> 	    "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
> 	    "unlock1" : "a_mutex",
> 	    "unlock2" : "pi-mutex"
> 	}
>     }
> }
> 
> Adding semaphores is possible but rt-app has no easy way to initialize
> individual objects, e.g. sem_init(..., value). The only way I see is via the
> global section, like "pi_enabled". But then, this is true for all objects of
> this kind (in this case mutexes)?

Right, global section should work fine. Why do you think this is a
problem/limitation?

> So the following couple of lines extension to rt-app works because both
> semaphores can be initialized to 0:
> 
>  {
>     "global": {
>         "calibration" : 130,
> 	"pi_enabled" : true
>     },
>     "tasks": {
>         "rt_task": {
> 	    "loop" : 100,
> 	    "policy" : "SCHED_FIFO",
> 	    "cpus" : [0],
> 
> 	    "sem_wait" : "b_sem",
> 	    "sleep" : 1000,
> 	    "sem_post" : "a_sem",
> 
> 	    "lock" : "pi-mutex",
> 	    "unlock" : "pi-mutex"
>         },
> 	"cfs_task": {
> 	    "loop" : 100,
> 	    "policy" : "SCHED_OTHER",
> 	    "cpus" : [4],
> 
> 	    "lock" : "pi-mutex",
> 	    "sleep" : 1000,
> 	    "sem_post" : "b_sem",
> 	    "sem_wait" : "a_sem",
> 	    "unlock" : "pi-mutex"
> 	}
>     }
> }
> 
> Any thoughts on that? I can see something like this as infrastructure to
> create a regression test case based on rt-app and standard ftrace.

Agree. I guess we should add your first example to the repo (you'd be
very welcome to create a PR) already and then work to support the second?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-07  7:16         ` Juri Lelli
@ 2018-09-07  7:58           ` Vincent Guittot
  2018-09-11  6:24             ` Dietmar Eggemann
  0 siblings, 1 reply; 29+ messages in thread
From: Vincent Guittot @ 2018-09-07  7:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Dietmar Eggemann, migueldedios, Cc: Steve Muckle, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Cc: Android Kernel, Todd Kjos,
	Paul Turner, Quentin Perret, Patrick Bellasi, Chris Redpath,
	Morten Rasmussen, John Dias

On Fri, 7 Sep 2018 at 09:16, Juri Lelli <juri.lelli@gmail.com> wrote:
>
> On 06/09/18 16:25, Dietmar Eggemann wrote:
> > Hi Juri,
> >
> > On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > > Hi,
> > > >
> > > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > > From: John Dias <joaodias@google.com>
> >
> > [...]
> >
> > > >
> > > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > > rt-tests but wasn't able to do so.
> > > >
> > > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs
> > > >
> > > > Starting PI Stress Test
> > > > Number of thread groups: 1
> > > > Duration of test run: 1 seconds
> > > > Number of inversions per group: 1
> > > >       Admin thread SCHED_FIFO priority 4
> > > > 1 groups of 3 threads will be created
> > > >        High thread SCHED_FIFO priority 3
> > > >         Med thread SCHED_FIFO priority 2
> > > >         Low thread SCHED_OTHER nice 0
> > > >
> > > > # ./pip_stress
> > > >
> > > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > > dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
> > > > from se->vruntime, is called on it before it gets the rt prio.
> > > >
> > > > Maybe it requires a very specific use of the pthread library to provoke this
> > > > issue by making sure that the cfs tasks really blocks/sleeps?
> > >
> > > Maybe one could play with rt-app to recreate such specific use case?
> > >
> > > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> >
> > I played a little bit with rt-app on hikey960 to re-create Steve's test
> > program.
>
> Oh, nice! Thanks for sharing what you have got.
>
> > Since there is no semaphore support (sem_wait(), sem_post()) I used
> > condition variables (wait: pthread_cond_wait() , signal:
> > pthread_cond_signal()). It's not really the same since this is stateless but
> > sleeps before the signals help to maintain the state in this easy example.
> >
> > This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> >
> >
> >     "global": {
> >         "calibration" : 130,
> >       "pi_enabled" : true
> >     },
> >     "tasks": {
> >         "rt_task": {
> >           "loop" : 100,
> >           "policy" : "SCHED_FIFO",
> >           "cpus" : [0],
> >
> >           "lock" : "b_mutex",
> >           "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
> >           "unlock" : "b_mutex",
> >           "sleep" : 3000,
> >           "lock1" : "a_mutex",
> >           "signal" : "a_cond",
> >           "unlock1" : "a_mutex",
> >           "lock2" : "pi-mutex",
> >           "unlock2" : "pi-mutex"
> >         },
> >       "cfs_task": {
> >           "loop" : 100,
> >           "policy" : "SCHED_OTHER",
> >           "cpus" : [4],
> >
> >           "lock" : "pi-mutex",
> >           "sleep" : 3000,
> >           "lock1" : "b_mutex",
> >           "signal" : "b_cond",
> >           "unlock" : "b_mutex",
> >           "lock2" : "a_mutex",
> >           "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
> >           "unlock1" : "a_mutex",
> >           "unlock2" : "pi-mutex"
> >       }
> >     }
> > }
> >
> > Adding semaphores is possible but rt-app has no easy way to initialize
> > individual objects, e.g. sem_init(..., value). The only way I see is via the
> > global section, like "pi_enabled". But then, this is true for all objects of
> > this kind (in this case mutexes)?
>
> Right, global section should work fine. Why do you think this is a
> problem/limitation?

keep in mind that rt-app still have "ressources" section. This one is
optional and almost never used as resources can be created on the fly
but it's still there and can be used to initialize resources if needed
like semaphore

>
> > So the following couple of lines extension to rt-app works because both
> > semaphores can be initialized to 0:
> >
> >  {
> >     "global": {
> >         "calibration" : 130,
> >       "pi_enabled" : true
> >     },
> >     "tasks": {
> >         "rt_task": {
> >           "loop" : 100,
> >           "policy" : "SCHED_FIFO",
> >           "cpus" : [0],
> >
> >           "sem_wait" : "b_sem",
> >           "sleep" : 1000,
> >           "sem_post" : "a_sem",
> >
> >           "lock" : "pi-mutex",
> >           "unlock" : "pi-mutex"
> >         },
> >       "cfs_task": {
> >           "loop" : 100,
> >           "policy" : "SCHED_OTHER",
> >           "cpus" : [4],
> >
> >           "lock" : "pi-mutex",
> >           "sleep" : 1000,
> >           "sem_post" : "b_sem",
> >           "sem_wait" : "a_sem",
> >           "unlock" : "pi-mutex"
> >       }
> >     }
> > }
> >
> > Any thoughts on that? I can see something like this as infrastructure to
> > create a regression test case based on rt-app and standard ftrace.
>
> Agree. I guess we should add your first example to the repo (you'd be
> very welcome to create a PR) already and then work to support the second?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-07  7:58           ` Vincent Guittot
@ 2018-09-11  6:24             ` Dietmar Eggemann
  0 siblings, 0 replies; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-11  6:24 UTC (permalink / raw)
  To: Vincent Guittot, Juri Lelli
  Cc: migueldedios, Cc: Steve Muckle, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Cc: Android Kernel, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias

On 09/07/2018 12:58 AM, Vincent Guittot wrote:
> On Fri, 7 Sep 2018 at 09:16, Juri Lelli <juri.lelli@gmail.com> wrote:
>>
>> On 06/09/18 16:25, Dietmar Eggemann wrote:
>>> Hi Juri,
>>>
>>> On 08/23/2018 11:54 PM, Juri Lelli wrote:
>>>> On 23/08/18 18:52, Dietmar Eggemann wrote:
>>>>> Hi,
>>>>>
>>>>> On 08/21/2018 01:54 AM, Miguel de Dios wrote:
>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>>>>> From: John Dias <joaodias@google.com>

[...]

>>> Adding semaphores is possible but rt-app has no easy way to initialize
>>> individual objects, e.g. sem_init(..., value). The only way I see is via the
>>> global section, like "pi_enabled". But then, this is true for all objects of
>>> this kind (in this case mutexes)?
>>
>> Right, global section should work fine. Why do you think this is a
>> problem/limitation?
> 
> keep in mind that rt-app still have "ressources" section. This one is
> optional and almost never used as resources can be created on the fly
> but it's still there and can be used to initialize resources if needed
> like semaphore

I wasn't aware of that but this will do the job AFAICS. I just have to 
re-introduce the direct calls to init_foo_resource() (in this case 
init_sem_resource()) in init_resource_data() and call that instead of 
init_resource_data() for semaphores listed in the global resources section.

Example for a semaphore b_sem with initial value eq. 1:

     "resources" : {
         "b_sem" : { "type" : "sem_wait", "value" : 1 }
     }

[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-08-28 14:53           ` Dietmar Eggemann
  2018-08-29 10:54             ` Dietmar Eggemann
@ 2018-09-26  9:50             ` Wanpeng Li
  2018-09-26 22:38               ` Dietmar Eggemann
  1 sibling, 1 reply; 29+ messages in thread
From: Wanpeng Li @ 2018-09-26  9:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, smuckle, migueldedios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, joaodias,
	Wanpeng Li

Hi Dietmar,
On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> >> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> >>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
> >>>
> >>>>>> When rt_mutex_setprio changes a task's scheduling class to RT,
> >>>>>> we're seeing cases where the task's vruntime is not updated
> >>>>>> correctly upon return to the fair class.
> >>>
> >>>>>> Specifically, the following is being observed:
> >>>>>> - task is deactivated while still in the fair class
> >>>>>> - task is boosted to RT via rt_mutex_setprio, which changes
> >>>>>>      the task to RT and calls check_class_changed.
> >>>>>> - check_class_changed leads to detach_task_cfs_rq, at which point
> >>>>>>      the vruntime_normalized check sees that the task's state is TASK_WAKING,
> >>>>>>      which results in skipping the subtraction of the rq's min_vruntime
> >>>>>>      from the task's vruntime

> >>>>>> - later, when the prio is deboosted and the task is moved back
> >>>>>>      to the fair class, the fair rq's min_vruntime is added to
> >>>>>>      the task's vruntime, even though it wasn't subtracted earlier.

Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? attach_task_cfs_rq will not
do that the same reason as detach_task_cfs_rq. fair task's
sched_remote_wakeup is false which results in vruntime will not be
renormalized in enqueue_entity.

Regards,
Wanpeng Li

> >>>
> >>> I'm thinking that is an incomplete scenario; where do we get to
> >>> TASK_WAKING.
> >>
> >> Yes there's a missing bit of context here at the beginning that the task to
> >> be boosted had already been put into TASK_WAKING.
> >
> > See, I'm confused...
> >
> > The only time TASK_WAKING is visible, is if we've done a remote wakeup
> > and it's 'stuck' on the remote wake_list. And in that case we've done
> > migrate_task_rq_fair() on it.
> >
> > So by the time either rt_mutex_setprio() or __sched_setscheduler() get
> > to calling check_class_changed(), under both pi_lock and rq->lock, the
> > vruntime_normalized() thing should be right.
> >
> > So please detail the exact scenario. Because I'm not seeing it.
>
> Using Steve's test program (https://lkml.org/lkml/2018/8/24/686) I see the
> issue but only if the two tasks (rt_task, fair_task) run on 2 cpus which
> don't share LLC (e.g. CPU0 and CPU4 on hikey960).
>
> So the wakeup goes the TTWU_QUEUE && !share_cache (ttwu_queue_remote) path.
>
> ...
> rt_task-3579  [000] 35.391573: sched_waking:     comm=fair_task pid=3580 prio=120 target_cpu=004
> rt_task-3579  [000] 35.391580: bprint:           try_to_wake_up: try_to_wake_up: task=fair_task pid=3580 task_cpu(p)=4 p->on_rq=0
> rt_task-3579  [000] 35.391584: bprint:           try_to_wake_up: ttwu_queue: task=fair_task pid=3580
> rt_task-3579  [000] 35.391588: bprint:           try_to_wake_up: ttwu_queue_remote: task=fair_task pid=3580
> rt_task-3579  [000] 35.391591: bprint:           try_to_wake_up: ttwu_queue_remote: cpu=4 smp_send_reschedule
> rt_task-3579  [000] 35.391627: sched_pi_setprio: comm=fair_task pid=3580 oldprio=120 newprio=19
> rt_task-3579  [000] 35.391635: bprint:           rt_mutex_setprio: task=fair_task pid=3580 prio=120->19 queued=0 running=0 state=0x200 vruntime=46922871 cpu=4 cfs_rq->min_vruntime=7807420844
> rt_task-3579  [000] 35.391641: bprint:           rt_mutex_setprio: p->prio set: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
> rt_task-3579  [000] 35.391646: bprint:           rt_mutex_setprio: queued checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
> rt_task-3579  [000] 35.391652: bprint:           rt_mutex_setprio: running checked: task=fair_task pid=3580 prio=19 queued=0 running=0 state=0x200 vruntime=46922871
> rt_task-3579  [000] 35.391657: bprint:           rt_mutex_setprio: fair_class=0xffff000008da2c80 rt_class=0xffff000008da2d70 prev_class=0xffff000008da2c80 p->sched_class=0xffff000008da2d70 oldprio=120 p->prio=19
> rt_task-3579  [000] 35.391661: bprint:           detach_task_cfs_rq: task=fair_task pid=3580 cpu=4 vruntime_normalized=1
> rt_task-3579  [000] 35.391706: sched_switch:     rt_task:3579 [19] D ==> swapper/0:0 [120]
>  <idle>-0     [004] 35.391828: bprint:           ttwu_do_activate: ttwu_do_activate: task=fair_task pid=3580
>  <idle>-0     [004] 35.391832: bprint:           ttwu_do_activate: ttwu_activate: task=fair_task pid=3580
>  <idle>-0     [004] 35.391833: bprint:           ttwu_do_wakeup: ttwu_do_wakeup: task=fair_task pid=3580
>  <idle>-0     [004] 35.391834: sched_wakeup:     fair_task:3580 [19] success=1 CPU:004
>
> It doesn't happen on hikey960 when I use two cpus of the same LLC or on my
> laptop (i7-4750HQ).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-26  9:50             ` Wanpeng Li
@ 2018-09-26 22:38               ` Dietmar Eggemann
  2018-09-27  1:19                 ` Wanpeng Li
  0 siblings, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-26 22:38 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, smuckle, migueldedios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, joaodias,
	Wanpeng Li

Hi,

On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> Hi Dietmar,
> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:

[...]

>>>>>>>> - later, when the prio is deboosted and the task is moved back
>>>>>>>>       to the fair class, the fair rq's min_vruntime is added to
>>>>>>>>       the task's vruntime, even though it wasn't subtracted earlier.
> 
> Could you point out when the fair rq's min_vruntime is added to the
> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> do that the same reason as detach_task_cfs_rq. fair task's
> sched_remote_wakeup is false which results in vruntime will not be
> renormalized in enqueue_entity.

The cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair().

It's just that without this patch, which adds the '&&
p->sched_remote_wakeup' bit to the condition under which
vruntime_normalized() returns true, detach_task_cfs_rq() won't go into the
'if (!vruntime_normalized(p))' path and not subtract cfs_rq->min_vruntime
from se->vruntime. 

Since 'task_cpu(p) equal cpu' in try_to_wake_up() for the fair task,
WF_MIGRATED is not set and set_task_cpu() -> migrate_task_rq_fair()
is not called which could subtract cfs_rq->min_vruntime from
se->vruntime as well.

My former example with a different set of trace events:

fair_task-3580  [004]    35.389346: sched_stat_runtime:   comm=fair_task pid=3580 runtime=45312 [ns] vruntime=46922871 [ns] <-- se->vruntime=46.922.871
...
  rt_task-3579  [000]    35.391573: sched_waking:         comm=fair_task pid=3580 prio=120 target_cpu=004
...
  rt_task-3579  [000]    35.391627: sched_pi_setprio:     comm=fair_task pid=3580 oldprio=120 newprio=19
...
  rt_task-3579  [000]    35.391661: bprint:               detach_task_cfs_rq: task=fair_task pid=3580 cpu=4 vruntime_normalized=1
  rt_task-3579  [000]    35.391706: sched_switch:         rt_task:3579 [19] D ==> swapper/0:0 [120]
   <idle>-0     [004]    35.391834: sched_wakeup:         fair_task:3580 [19] success=1 CPU:004
   <idle>-0     [004]    35.391840: sched_switch:         swapper/4:0 [120] S ==> fair_task:3580 [19]
fair_task-3580  [004]    35.391853: sched_pi_setprio:     comm=fair_task pid=3580 oldprio=19 newprio=120
...
fair_task-3580  [004]    35.391863: bprint:               enqueue_task_fair: task=fair_task pid=3580 curr=0 se->vruntime=93845742 cpu=4 cfs_rq->min_vruntime=46922871
...
fair_task-3580  [004]    35.391877: sched_waking:         comm=rt_task pid=3579 prio=19 target_cpu=000
...
fair_task-3580  [004]    35.391885: sched_stat_runtime:   comm=fair_task pid=3580 runtime=31250 [ns] vruntime=93876992 [ns] <-- se->vruntime=93.876.992

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-26 22:38               ` Dietmar Eggemann
@ 2018-09-27  1:19                 ` Wanpeng Li
  2018-09-27 13:22                   ` Dietmar Eggemann
  2018-09-28 16:43                   ` Joel Fernandes
  0 siblings, 2 replies; 29+ messages in thread
From: Wanpeng Li @ 2018-09-27  1:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Steve Muckle, Miguel de Dios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, John Dias,
	Wanpeng Li

On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> Hi,
>
> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> > Hi Dietmar,
> > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> >>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> >>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>
> [...]
>
> >>>>>>>> - later, when the prio is deboosted and the task is moved back
> >>>>>>>>       to the fair class, the fair rq's min_vruntime is added to
> >>>>>>>>       the task's vruntime, even though it wasn't subtracted earlier.
> >
> > Could you point out when the fair rq's min_vruntime is added to the
> > task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> > do that the same reason as detach_task_cfs_rq. fair task's
> > sched_remote_wakeup is false which results in vruntime will not be
> > renormalized in enqueue_entity.
>
> The cfs_rq->min_vruntime is still added to the se->vruntime in
> enqueue_task_fair().

I understand what your patch done,

On your CPU4:
scheduler_ipi()
 -> sched_ttwu_pending()
      -> ttwu_do_activate()    => p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
           -> ttwu_activate()
                -> activate_task()
                     -> enqueue_task()
                          -> enqueue_task_fair()
                               -> enqueue_entity()
                                    bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-27  1:19                 ` Wanpeng Li
@ 2018-09-27 13:22                   ` Dietmar Eggemann
  2018-09-28  0:43                     ` Wanpeng Li
  2018-09-28 16:43                   ` Joel Fernandes
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-27 13:22 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Steve Muckle, Miguel de Dios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, John Dias,
	Wanpeng Li

On 09/27/2018 03:19 AM, Wanpeng Li wrote:
> On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hi,
>>
>> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
>>> Hi Dietmar,
>>> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>>>>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>>>>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>>>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>
>> [...]
>>
>>>>>>>>>> - later, when the prio is deboosted and the task is moved back
>>>>>>>>>>        to the fair class, the fair rq's min_vruntime is added to
>>>>>>>>>>        the task's vruntime, even though it wasn't subtracted earlier.
>>>
>>> Could you point out when the fair rq's min_vruntime is added to the
>>> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
>>> do that the same reason as detach_task_cfs_rq. fair task's
>>> sched_remote_wakeup is false which results in vruntime will not be
>>> renormalized in enqueue_entity.
>>
>> The cfs_rq->min_vruntime is still added to the se->vruntime in
>> enqueue_task_fair().
> 
> I understand what your patch done,

It's not my patch ;-) I just helped to find out under which 
circumstances this issue can happen.

> On your CPU4:
> scheduler_ipi()
>   -> sched_ttwu_pending()
>        -> ttwu_do_activate()    => p->sched_remote_wakeup should be
> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>             -> ttwu_activate()
>                  -> activate_task()
>                       -> enqueue_task()
>                            -> enqueue_task_fair()
>                                 -> enqueue_entity()
>                                      bool renorm = !(flags &
> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> so renorm is false in enqueue_entity(), why you mentioned that the
> cfs_rq->min_vruntime is still added to the se->vruntime in
> enqueue_task_fair()?

Maybe this is a misunderstanding on my side but didn't you asked me to 
'... Could you point out when the fair rq's min_vruntime is added to the 
task's vruntime in your *later* scenario? ...'

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-27 13:22                   ` Dietmar Eggemann
@ 2018-09-28  0:43                     ` Wanpeng Li
  2018-09-28 16:10                       ` Steve Muckle
  2018-09-28 17:11                       ` Dietmar Eggemann
  0 siblings, 2 replies; 29+ messages in thread
From: Wanpeng Li @ 2018-09-28  0:43 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Steve Muckle, Miguel de Dios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, John Dias,
	Wanpeng Li

On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/27/2018 03:19 AM, Wanpeng Li wrote:
> > On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
> >>> Hi Dietmar,
> >>> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
> >>>>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
> >>>>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
> >>>>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
> >>
> >> [...]
> >>
> >>>>>>>>>> - later, when the prio is deboosted and the task is moved back
> >>>>>>>>>>        to the fair class, the fair rq's min_vruntime is added to
> >>>>>>>>>>        the task's vruntime, even though it wasn't subtracted earlier.
> >>>
> >>> Could you point out when the fair rq's min_vruntime is added to the
> >>> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
> >>> do that the same reason as detach_task_cfs_rq. fair task's
> >>> sched_remote_wakeup is false which results in vruntime will not be
> >>> renormalized in enqueue_entity.
> >>
> >> The cfs_rq->min_vruntime is still added to the se->vruntime in
> >> enqueue_task_fair().
> >
> > I understand what your patch done,
>
> It's not my patch ;-) I just helped to find out under which
> circumstances this issue can happen.
>
> > On your CPU4:
> > scheduler_ipi()
> >   -> sched_ttwu_pending()
> >        -> ttwu_do_activate()    => p->sched_remote_wakeup should be
> > false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
> >             -> ttwu_activate()
> >                  -> activate_task()
> >                       -> enqueue_task()
> >                            -> enqueue_task_fair()
> >                                 -> enqueue_entity()
> >                                      bool renorm = !(flags &
> > ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> > so renorm is false in enqueue_entity(), why you mentioned that the
> > cfs_rq->min_vruntime is still added to the se->vruntime in
> > enqueue_task_fair()?
>
> Maybe this is a misunderstanding on my side but didn't you asked me to
> '... Could you point out when the fair rq's min_vruntime is added to the
> task's vruntime in your *later* scenario? ...'

Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-28  0:43                     ` Wanpeng Li
@ 2018-09-28 16:10                       ` Steve Muckle
  2018-09-28 16:45                         ` Joel Fernandes
  2018-09-28 17:35                         ` Dietmar Eggemann
  2018-09-28 17:11                       ` Dietmar Eggemann
  1 sibling, 2 replies; 29+ messages in thread
From: Steve Muckle @ 2018-09-28 16:10 UTC (permalink / raw)
  To: Wanpeng Li, Dietmar Eggemann
  Cc: Peter Zijlstra, Miguel de Dios, Ingo Molnar, LKML, kernel-team,
	Todd Kjos, Paul Turner, quentin.perret, Patrick Bellasi,
	Chris.Redpath, Morten Rasmussen, John Dias, Wanpeng Li

On 09/27/2018 05:43 PM, Wanpeng Li wrote:
>>> On your CPU4:
>>> scheduler_ipi()
>>>    -> sched_ttwu_pending()
>>>         -> ttwu_do_activate()    => p->sched_remote_wakeup should be
>>> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>>>              -> ttwu_activate()
>>>                   -> activate_task()
>>>                        -> enqueue_task()
>>>                             -> enqueue_task_fair()
>>>                                  -> enqueue_entity()
>>>                                       bool renorm = !(flags &
>>> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
>>> so renorm is false in enqueue_entity(), why you mentioned that the
>>> cfs_rq->min_vruntime is still added to the se->vruntime in
>>> enqueue_task_fair()?
>>
>> Maybe this is a misunderstanding on my side but didn't you asked me to
>> '... Could you point out when the fair rq's min_vruntime is added to the
>> task's vruntime in your *later* scenario? ...'
> 
> Yeah, if the calltrace above and my analysis is correct, then the fair
> rq's min_vruntime will not be added to the task's vruntime in your
> *later* scenario, which means that your patch is not necessary.

In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.

Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.

thanks,
Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-27  1:19                 ` Wanpeng Li
  2018-09-27 13:22                   ` Dietmar Eggemann
@ 2018-09-28 16:43                   ` Joel Fernandes
  1 sibling, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2018-09-28 16:43 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Dietmar Eggemann, Peter Zijlstra, Steve Muckle, Miguel de Dios,
	Ingo Molnar, LKML, Cc: Android Kernel, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias, Wanpeng Li

On Wed, Sep 26, 2018 at 6:19 PM, Wanpeng Li <kernellwp@gmail.com> wrote:
> On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hi,
>>
>> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
>> > Hi Dietmar,
>> > On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> >>
>> >> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>> >>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>> >>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>> >>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>
>> [...]
>>
>> >>>>>>>> - later, when the prio is deboosted and the task is moved back
>> >>>>>>>>       to the fair class, the fair rq's min_vruntime is added to
>> >>>>>>>>       the task's vruntime, even though it wasn't subtracted earlier.
>> >
>> > Could you point out when the fair rq's min_vruntime is added to the
>> > task's vruntime in your *later* scenario? attach_task_cfs_rq will not
>> > do that the same reason as detach_task_cfs_rq. fair task's
>> > sched_remote_wakeup is false which results in vruntime will not be
>> > renormalized in enqueue_entity.
>>
>> The cfs_rq->min_vruntime is still added to the se->vruntime in
>> enqueue_task_fair().
>
> I understand what your patch done,
>
> On your CPU4:
> scheduler_ipi()
>  -> sched_ttwu_pending()
>       -> ttwu_do_activate()    => p->sched_remote_wakeup should be
> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>            -> ttwu_activate()
>                 -> activate_task()
>                      -> enqueue_task()
>                           -> enqueue_task_fair()
>                                -> enqueue_entity()
>                                     bool renorm = !(flags &
> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> so renorm is false in enqueue_entity(), why you mentioned that the
> cfs_rq->min_vruntime is still added to the se->vruntime in
> enqueue_task_fair()?

If I understand John's original patch correctly, the additional
vruntime is added when the class of the waking task is changed during
priority deboost so that path is different from the one you listed
above? Perhaps you should be looking at rt_mutex_setprio?

 - Joel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-28 16:10                       ` Steve Muckle
@ 2018-09-28 16:45                         ` Joel Fernandes
  2018-09-28 17:35                         ` Dietmar Eggemann
  1 sibling, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2018-09-28 16:45 UTC (permalink / raw)
  To: Steve Muckle
  Cc: Wanpeng Li, Dietmar Eggemann, Peter Zijlstra, Miguel de Dios,
	Ingo Molnar, LKML, Cc: Android Kernel, Todd Kjos, Paul Turner,
	Quentin Perret, Patrick Bellasi, Chris Redpath, Morten Rasmussen,
	John Dias, Wanpeng Li

On Fri, Sep 28, 2018 at 9:10 AM, 'Steve Muckle' via kernel-team
<kernel-team@android.com> wrote:
> On 09/27/2018 05:43 PM, Wanpeng Li wrote:
>>>>
>>>> On your CPU4:
>>>> scheduler_ipi()
>>>>    -> sched_ttwu_pending()
>>>>         -> ttwu_do_activate()    => p->sched_remote_wakeup should be
>>>> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>>>>              -> ttwu_activate()
>>>>                   -> activate_task()
>>>>                        -> enqueue_task()
>>>>                             -> enqueue_task_fair()
>>>>                                  -> enqueue_entity()
>>>>                                       bool renorm = !(flags &
>>>> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
>>>> so renorm is false in enqueue_entity(), why you mentioned that the
>>>> cfs_rq->min_vruntime is still added to the se->vruntime in
>>>> enqueue_task_fair()?
>>>
>>>
>>> Maybe this is a misunderstanding on my side but didn't you asked me to
>>> '... Could you point out when the fair rq's min_vruntime is added to the
>>> task's vruntime in your *later* scenario? ...'
>>
>>
>> Yeah, if the calltrace above and my analysis is correct, then the fair
>> rq's min_vruntime will not be added to the task's vruntime in your
>> *later* scenario, which means that your patch is not necessary.
>
>
> In the scenario I observed, the task is not waking - it is running and being
> deboosted from priority inheritance, transitioning from RT to CFS.
>
> Dietmar and I both were able to reproduce the issue with the testcase I
> posted earlier in this thread.

Can this issue still show up say if the wake up was not remote? Say
the task was locally awakened. In that case do we still need to check
the class in vruntime_normalized like John was doing? Just want to
make sure we caught all scenarios.

- Joel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-28  0:43                     ` Wanpeng Li
  2018-09-28 16:10                       ` Steve Muckle
@ 2018-09-28 17:11                       ` Dietmar Eggemann
  1 sibling, 0 replies; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-28 17:11 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Steve Muckle, Miguel de Dios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, John Dias,
	Wanpeng Li

On 09/28/2018 02:43 AM, Wanpeng Li wrote:
> On Thu, 27 Sep 2018 at 21:23, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 09/27/2018 03:19 AM, Wanpeng Li wrote:
>>> On Thu, 27 Sep 2018 at 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 09/26/2018 11:50 AM, Wanpeng Li wrote:
>>>>> Hi Dietmar,
>>>>> On Tue, 28 Aug 2018 at 22:55, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>>
>>>>>> On 08/27/2018 12:14 PM, Peter Zijlstra wrote:
>>>>>>> On Fri, Aug 24, 2018 at 02:24:48PM -0700, Steve Muckle wrote:
>>>>>>>> On 08/24/2018 02:47 AM, Peter Zijlstra wrote:
>>>>>>>>>>> On 08/17/2018 11:27 AM, Steve Muckle wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>>>>>> - later, when the prio is deboosted and the task is moved back
>>>>>>>>>>>>         to the fair class, the fair rq's min_vruntime is added to
>>>>>>>>>>>>         the task's vruntime, even though it wasn't subtracted earlier.
>>>>>
>>>>> Could you point out when the fair rq's min_vruntime is added to the
>>>>> task's vruntime in your *later* scenario? attach_task_cfs_rq will not
>>>>> do that the same reason as detach_task_cfs_rq. fair task's
>>>>> sched_remote_wakeup is false which results in vruntime will not be
>>>>> renormalized in enqueue_entity.
>>>>
>>>> The cfs_rq->min_vruntime is still added to the se->vruntime in
>>>> enqueue_task_fair().
>>>
>>> I understand what your patch done,
>>
>> It's not my patch ;-) I just helped to find out under which
>> circumstances this issue can happen.
>>
>>> On your CPU4:
>>> scheduler_ipi()
>>>    -> sched_ttwu_pending()
>>>         -> ttwu_do_activate()    => p->sched_remote_wakeup should be
>>> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>>>              -> ttwu_activate()
>>>                   -> activate_task()
>>>                        -> enqueue_task()
>>>                             -> enqueue_task_fair()
>>>                                  -> enqueue_entity()
>>>                                       bool renorm = !(flags &
>>> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
>>> so renorm is false in enqueue_entity(), why you mentioned that the
>>> cfs_rq->min_vruntime is still added to the se->vruntime in
>>> enqueue_task_fair()?
>>
>> Maybe this is a misunderstanding on my side but didn't you asked me to
>> '... Could you point out when the fair rq's min_vruntime is added to the
>> task's vruntime in your *later* scenario? ...'
> 
> Yeah, if the calltrace above and my analysis is correct, then the fair
> rq's min_vruntime will not be added to the task's vruntime in your
> *later* scenario, which means that your patch is not necessary.

Ah, ok, both, the ENQUEUE_WAKEUP and the ENQUEUE_MIGRATED are not set in 
the enqueue_entity() call so renorm is 1 (flags is 0xe).





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-28 16:10                       ` Steve Muckle
  2018-09-28 16:45                         ` Joel Fernandes
@ 2018-09-28 17:35                         ` Dietmar Eggemann
  2018-09-29  1:07                           ` Wanpeng Li
  1 sibling, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2018-09-28 17:35 UTC (permalink / raw)
  To: Steve Muckle, Wanpeng Li
  Cc: Peter Zijlstra, Miguel de Dios, Ingo Molnar, LKML, kernel-team,
	Todd Kjos, Paul Turner, quentin.perret, Patrick Bellasi,
	Chris.Redpath, Morten Rasmussen, John Dias, Wanpeng Li

On 09/28/2018 06:10 PM, Steve Muckle wrote:
> On 09/27/2018 05:43 PM, Wanpeng Li wrote:
>>>> On your CPU4:
>>>> scheduler_ipi()
>>>>    -> sched_ttwu_pending()
>>>>         -> ttwu_do_activate()    => p->sched_remote_wakeup should be
>>>> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
>>>>              -> ttwu_activate()
>>>>                   -> activate_task()
>>>>                        -> enqueue_task()
>>>>                             -> enqueue_task_fair()
>>>>                                  -> enqueue_entity()
>>>>                                       bool renorm = !(flags &
>>>> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
>>>> so renorm is false in enqueue_entity(), why you mentioned that the
>>>> cfs_rq->min_vruntime is still added to the se->vruntime in
>>>> enqueue_task_fair()?
>>>
>>> Maybe this is a misunderstanding on my side but didn't you asked me to
>>> '... Could you point out when the fair rq's min_vruntime is added to the
>>> task's vruntime in your *later* scenario? ...'
>>
>> Yeah, if the calltrace above and my analysis is correct, then the fair
>> rq's min_vruntime will not be added to the task's vruntime in your
>> *later* scenario, which means that your patch is not necessary.
> 
> In the scenario I observed, the task is not waking - it is running and 
> being deboosted from priority inheritance, transitioning from RT to CFS.
> 
> Dietmar and I both were able to reproduce the issue with the testcase I 
> posted earlier in this thread.

Correct, and with the same testcase I got this call stack in this scenario:

[   35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted 
4.18.0-rc6-00052-g11b7dafa2edb-dirty #5
[   35.597217] Hardware name: ARM Juno development board (r0) (DT)
[   35.603080] Call trace:
[   35.605509]  dump_backtrace+0x0/0x168
[   35.609138]  show_stack+0x24/0x30
[   35.612424]  dump_stack+0xac/0xe4
[   35.615710]  enqueue_task_fair+0xae0/0x11c0
[   35.619854]  rt_mutex_setprio+0x5a0/0x628
[   35.623827]  mark_wakeup_next_waiter+0x7c/0xc8
[   35.628228]  __rt_mutex_futex_unlock+0x30/0x50
[   35.632630]  do_futex+0x74c/0xb28
[   35.635912]  sys_futex+0x118/0x198
[   35.639280]  el0_svc_naked+0x30/0x34

[...]


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] sched/fair: vruntime should normalize when switching from fair
  2018-09-28 17:35                         ` Dietmar Eggemann
@ 2018-09-29  1:07                           ` Wanpeng Li
  0 siblings, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2018-09-29  1:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steve Muckle, Peter Zijlstra, Miguel de Dios, Ingo Molnar, LKML,
	kernel-team, Todd Kjos, Paul Turner, quentin.perret,
	Patrick Bellasi, Chris.Redpath, Morten Rasmussen, John Dias,
	Wanpeng Li

On Sat, 29 Sep 2018 at 01:36, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/28/2018 06:10 PM, Steve Muckle wrote:
> > On 09/27/2018 05:43 PM, Wanpeng Li wrote:
> >>>> On your CPU4:
> >>>> scheduler_ipi()
> >>>>    -> sched_ttwu_pending()
> >>>>         -> ttwu_do_activate()    => p->sched_remote_wakeup should be
> >>>> false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
> >>>>              -> ttwu_activate()
> >>>>                   -> activate_task()
> >>>>                        -> enqueue_task()
> >>>>                             -> enqueue_task_fair()
> >>>>                                  -> enqueue_entity()
> >>>>                                       bool renorm = !(flags &
> >>>> ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
> >>>> so renorm is false in enqueue_entity(), why you mentioned that the
> >>>> cfs_rq->min_vruntime is still added to the se->vruntime in
> >>>> enqueue_task_fair()?
> >>>
> >>> Maybe this is a misunderstanding on my side but didn't you asked me to
> >>> '... Could you point out when the fair rq's min_vruntime is added to the
> >>> task's vruntime in your *later* scenario? ...'
> >>
> >> Yeah, if the calltrace above and my analysis is correct, then the fair
> >> rq's min_vruntime will not be added to the task's vruntime in your
> >> *later* scenario, which means that your patch is not necessary.
> >
> > In the scenario I observed, the task is not waking - it is running and
> > being deboosted from priority inheritance, transitioning from RT to CFS.
> >
> > Dietmar and I both were able to reproduce the issue with the testcase I
> > posted earlier in this thread.
>
> Correct, and with the same testcase I got this call stack in this scenario:
>
> [   35.588509] CPU: 1 PID: 2926 Comm: fair_task Not tainted
> 4.18.0-rc6-00052-g11b7dafa2edb-dirty #5
> [   35.597217] Hardware name: ARM Juno development board (r0) (DT)
> [   35.603080] Call trace:
> [   35.605509]  dump_backtrace+0x0/0x168
> [   35.609138]  show_stack+0x24/0x30
> [   35.612424]  dump_stack+0xac/0xe4
> [   35.615710]  enqueue_task_fair+0xae0/0x11c0
> [   35.619854]  rt_mutex_setprio+0x5a0/0x628
> [   35.623827]  mark_wakeup_next_waiter+0x7c/0xc8
> [   35.628228]  __rt_mutex_futex_unlock+0x30/0x50
> [   35.632630]  do_futex+0x74c/0xb28
> [   35.635912]  sys_futex+0x118/0x198
> [   35.639280]  el0_svc_naked+0x30/0x34

Thanks for pointing out. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-09-29  1:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-17 18:27 [PATCH] sched/fair: vruntime should normalize when switching from fair Steve Muckle
2018-08-20 23:54 ` Miguel de Dios
2018-08-23 16:52   ` Dietmar Eggemann
2018-08-24  6:54     ` Juri Lelli
2018-08-24 21:17       ` Steve Muckle
2018-09-06 23:25       ` Dietmar Eggemann
2018-09-07  7:16         ` Juri Lelli
2018-09-07  7:58           ` Vincent Guittot
2018-09-11  6:24             ` Dietmar Eggemann
2018-08-24  9:32   ` Peter Zijlstra
2018-08-24  9:47     ` Peter Zijlstra
2018-08-24 21:24       ` Steve Muckle
2018-08-27 11:14         ` Peter Zijlstra
2018-08-28 14:53           ` Dietmar Eggemann
2018-08-29 10:54             ` Dietmar Eggemann
2018-08-29 11:59               ` Peter Zijlstra
2018-08-29 15:33                 ` Dietmar Eggemann
2018-08-31 22:24                   ` Steve Muckle
2018-09-26  9:50             ` Wanpeng Li
2018-09-26 22:38               ` Dietmar Eggemann
2018-09-27  1:19                 ` Wanpeng Li
2018-09-27 13:22                   ` Dietmar Eggemann
2018-09-28  0:43                     ` Wanpeng Li
2018-09-28 16:10                       ` Steve Muckle
2018-09-28 16:45                         ` Joel Fernandes
2018-09-28 17:35                         ` Dietmar Eggemann
2018-09-29  1:07                           ` Wanpeng Li
2018-09-28 17:11                       ` Dietmar Eggemann
2018-09-28 16:43                   ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).