linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: attach extra runtime to the right avg
@ 2017-06-30  1:56 josef
  2017-07-02  9:37 ` Ingo Molnar
  2017-07-03  7:26 ` Vincent Guittot
  0 siblings, 2 replies; 12+ messages in thread
From: josef @ 2017-06-30  1:56 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, kernel-team; +Cc: Josef Bacik

From: Josef Bacik <jbacik@fb.com>

We only track the load avg of a se in 1024 ns chunks, so in order to
make up for the loss of the < 1024 ns part of a run/sleep delta we only
add the time we processed to the se->avg.last_update_time.  The problem
is there is no way to know if this extra time was while we were asleep
or while we were running.  Instead keep track of the remainder and apply
it in the appropriate place.  If the remainder was while we were
running, add it to the delta the next time we update the load avg while
running, and the same for sleeping.  This (coupled with other fixes)
mostly fixes the regression to my workload introduced by Peter's
experimental runnable load propagation patches.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/sched.h |  2 ++
 kernel/sched/fair.c   | 21 ++++++++++++++++++---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc6..ed147d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -316,6 +316,8 @@ struct sched_avg {
 	u64				load_sum;
 	u32				util_sum;
 	u32				period_contrib;
+	u32				sleep_remainder;
+	u32				run_remainder;
 	unsigned long			load_avg;
 	unsigned long			util_avg;
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c77e4b1..573bdcd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -813,6 +813,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 			 * expected state.
 			 */
 			se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
+			se->avg.sleep_remainder = 0;
+			se->avg.run_remainder = 0;
 			return;
 		}
 	}
@@ -2882,14 +2884,19 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
 	u64 delta;
+	u32 remainder = running ? sa->run_remainder : sa->sleep_remainder;
 
-	delta = now - sa->last_update_time;
+	delta = now - sa->last_update_time + remainder;
 	/*
 	 * This should only happen when time goes backwards, which it
 	 * unfortunately does during sched clock init when we swap over to TSC.
 	 */
 	if ((s64)delta < 0) {
 		sa->last_update_time = now;
+		if (running)
+			sa->run_remainder = 0;
+		else
+			sa->sleep_remainder = 0;
 		return 0;
 	}
 
@@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	 * Use 1024ns as the unit of measurement since it's a reasonable
 	 * approximation of 1us and fast to compute.
 	 */
+	remainder = delta & (1023UL);
+	sa->last_update_time = now;
+	if (running)
+		sa->run_remainder = remainder;
+	else
+		sa->sleep_remainder = remainder;
 	delta >>= 10;
 	if (!delta)
 		return 0;
 
-	sa->last_update_time += delta << 10;
-
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
 	 * accrues by two steps:
@@ -6102,6 +6113,8 @@ static void migrate_task_rq_fair(struct task_struct *p)
 
 	/* Tell new CPU we are migrated */
 	p->se.avg.last_update_time = 0;
+	p->se.avg.sleep_remainder = 0;
+	p->se.avg.run_remainder = 0;
 
 	/* We have migrated, no longer consider this task hot */
 	p->se.exec_start = 0;
@@ -9259,6 +9272,8 @@ static void task_move_group_fair(struct task_struct *p)
 #ifdef CONFIG_SMP
 	/* Tell se's cfs_rq has been changed -- migrated */
 	p->se.avg.last_update_time = 0;
+	p->se.avg.sleep_remainder = 0;
+	p->se.avg.run_remainder = 0;
 #endif
 	attach_task_cfs_rq(p);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-06-30  1:56 [RFC][PATCH] sched: attach extra runtime to the right avg josef
@ 2017-07-02  9:37 ` Ingo Molnar
  2017-07-03 13:30   ` Josef Bacik
  2017-07-04  9:41   ` Peter Zijlstra
  2017-07-03  7:26 ` Vincent Guittot
  1 sibling, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2017-07-02  9:37 UTC (permalink / raw)
  To: josef; +Cc: mingo, peterz, linux-kernel, kernel-team, Josef Bacik

* josef@toxicpanda.com <josef@toxicpanda.com> wrote:

> From: Josef Bacik <jbacik@fb.com>
> 
> We only track the load avg of a se in 1024 ns chunks, so in order to
> make up for the loss of the < 1024 ns part of a run/sleep delta we only
> add the time we processed to the se->avg.last_update_time.  The problem
> is there is no way to know if this extra time was while we were asleep
> or while we were running.  Instead keep track of the remainder and apply
> it in the appropriate place.  If the remainder was while we were
> running, add it to the delta the next time we update the load avg while
> running, and the same for sleeping.  This (coupled with other fixes)
> mostly fixes the regression to my workload introduced by Peter's
> experimental runnable load propagation patches.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

> @@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  	 * Use 1024ns as the unit of measurement since it's a reasonable
>  	 * approximation of 1us and fast to compute.
>  	 */
> +	remainder = delta & (1023UL);
> +	sa->last_update_time = now;
> +	if (running)
> +		sa->run_remainder = remainder;
> +	else
> +		sa->sleep_remainder = remainder;
>  	delta >>= 10;
>  	if (!delta)
>  		return 0;
>  
> -	sa->last_update_time += delta << 10;
> -

So I'm wondering, this chunk changes how sa->last_update_time is maintained in 
___update_load_avg(): the new code takes a precise timestamp, but the old code was 
not taking an imprecise timestamp, but was updating it via deltas - where each 
delta was rounded down to the nearest 1024 nsecs boundary.

That, if this is the main code path that updates ->last_update_time, creates a 
constant drift of rounding error that skews ->last_update_time into larger and 
larger distances from the real 'now' - ever increasing the value of 'delta'.

An intermediate approach to improve that skew would be something like below. It 
doesn't track the remainder like your patch does, but doesn't lose precision 
either, just rounds down 'now' to the nearest 1024 boundary.

Does this fix the regression you observed as well? Totally untested.

Thanks,

	Ingo

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 008c514dc241..b03703cd7989 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2965,7 +2965,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	if (!delta)
 		return 0;
 
-	sa->last_update_time += delta << 10;
+	sa->last_update_time = now & ~1023ULL;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-06-30  1:56 [RFC][PATCH] sched: attach extra runtime to the right avg josef
  2017-07-02  9:37 ` Ingo Molnar
@ 2017-07-03  7:26 ` Vincent Guittot
  2017-07-03 13:41   ` Josef Bacik
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2017-07-03  7:26 UTC (permalink / raw)
  To: josef; +Cc: mingo, Peter Zijlstra, linux-kernel, kernel-team, Josef Bacik

Hi Josef,

On 30 June 2017 at 03:56,  <josef@toxicpanda.com> wrote:
> From: Josef Bacik <jbacik@fb.com>
>
> We only track the load avg of a se in 1024 ns chunks, so in order to
> make up for the loss of the < 1024 ns part of a run/sleep delta we only
> add the time we processed to the se->avg.last_update_time.  The problem
> is there is no way to know if this extra time was while we were asleep
> or while we were running.  Instead keep track of the remainder and apply
> it in the appropriate place.  If the remainder was while we were
> running, add it to the delta the next time we update the load avg while
> running, and the same for sleeping.  This (coupled with other fixes)
> mostly fixes the regression to my workload introduced by Peter's
> experimental runnable load propagation patches.

IIUC, your workload is sensible to the fact that the min granularity
of the load tracking is 1us ?
The contribution seems to be quite small to have a real impact on the load_avg.
May be rounding last_update_time to the closest value policy instead
of the bottom value would be enough ? we would have 512ns precision

Have you got details about your use case that needs this sub
microsecond precision ?

>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  include/linux/sched.h |  2 ++
>  kernel/sched/fair.c   | 21 ++++++++++++++++++---
>  2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2b69fc6..ed147d4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -316,6 +316,8 @@ struct sched_avg {
>         u64                             load_sum;
>         u32                             util_sum;
>         u32                             period_contrib;
> +       u32                             sleep_remainder;
> +       u32                             run_remainder;
>         unsigned long                   load_avg;
>         unsigned long                   util_avg;
>  };
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c77e4b1..573bdcd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -813,6 +813,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
>                          * expected state.
>                          */
>                         se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
> +                       se->avg.sleep_remainder = 0;
> +                       se->avg.run_remainder = 0;
>                         return;
>                 }
>         }
> @@ -2882,14 +2884,19 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                   unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
>         u64 delta;
> +       u32 remainder = running ? sa->run_remainder : sa->sleep_remainder;
>
> -       delta = now - sa->last_update_time;
> +       delta = now - sa->last_update_time + remainder;

So this remainder comes from the previous identical phase (previous
running phase or previous not running phase).
But this previous phase can be quite old and you may need to decay the
remainder before applying it

>         /*
>          * This should only happen when time goes backwards, which it
>          * unfortunately does during sched clock init when we swap over to TSC.
>          */
>         if ((s64)delta < 0) {
>                 sa->last_update_time = now;
> +               if (running)
> +                       sa->run_remainder = 0;
> +               else
> +                       sa->sleep_remainder = 0;
>                 return 0;
>         }
>
> @@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>          * Use 1024ns as the unit of measurement since it's a reasonable
>          * approximation of 1us and fast to compute.
>          */
> +       remainder = delta & (1023UL);
> +       sa->last_update_time = now;

this revert (bb0bd044e65c : sched/fair: Increase PELT accuracy for small tasks)

> +       if (running)
> +               sa->run_remainder = remainder;
> +       else
> +               sa->sleep_remainder = remainder;
>         delta >>= 10;
>         if (!delta)
>                 return 0;
>
> -       sa->last_update_time += delta << 10;
> -
>         /*
>          * Now we know we crossed measurement unit boundaries. The *_avg
>          * accrues by two steps:
> @@ -6102,6 +6113,8 @@ static void migrate_task_rq_fair(struct task_struct *p)
>
>         /* Tell new CPU we are migrated */
>         p->se.avg.last_update_time = 0;
> +       p->se.avg.sleep_remainder = 0;
> +       p->se.avg.run_remainder = 0;
>
>         /* We have migrated, no longer consider this task hot */
>         p->se.exec_start = 0;
> @@ -9259,6 +9272,8 @@ static void task_move_group_fair(struct task_struct *p)
>  #ifdef CONFIG_SMP
>         /* Tell se's cfs_rq has been changed -- migrated */
>         p->se.avg.last_update_time = 0;
> +       p->se.avg.sleep_remainder = 0;
> +       p->se.avg.run_remainder = 0;
>  #endif
>         attach_task_cfs_rq(p);
>  }
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-02  9:37 ` Ingo Molnar
@ 2017-07-03 13:30   ` Josef Bacik
  2017-07-04  9:41   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2017-07-03 13:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: josef, mingo, peterz, linux-kernel, kernel-team, Josef Bacik

On Sun, Jul 02, 2017 at 11:37:18AM +0200, Ingo Molnar wrote:
> * josef@toxicpanda.com <josef@toxicpanda.com> wrote:
> 
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > We only track the load avg of a se in 1024 ns chunks, so in order to
> > make up for the loss of the < 1024 ns part of a run/sleep delta we only
> > add the time we processed to the se->avg.last_update_time.  The problem
> > is there is no way to know if this extra time was while we were asleep
> > or while we were running.  Instead keep track of the remainder and apply
> > it in the appropriate place.  If the remainder was while we were
> > running, add it to the delta the next time we update the load avg while
> > running, and the same for sleeping.  This (coupled with other fixes)
> > mostly fixes the regression to my workload introduced by Peter's
> > experimental runnable load propagation patches.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> 
> > @@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  	 * Use 1024ns as the unit of measurement since it's a reasonable
> >  	 * approximation of 1us and fast to compute.
> >  	 */
> > +	remainder = delta & (1023UL);
> > +	sa->last_update_time = now;
> > +	if (running)
> > +		sa->run_remainder = remainder;
> > +	else
> > +		sa->sleep_remainder = remainder;
> >  	delta >>= 10;
> >  	if (!delta)
> >  		return 0;
> >  
> > -	sa->last_update_time += delta << 10;
> > -
> 
> So I'm wondering, this chunk changes how sa->last_update_time is maintained in 
> ___update_load_avg(): the new code takes a precise timestamp, but the old code was 
> not taking an imprecise timestamp, but was updating it via deltas - where each 
> delta was rounded down to the nearest 1024 nsecs boundary.
> 
> That, if this is the main code path that updates ->last_update_time, creates a 
> constant drift of rounding error that skews ->last_update_time into larger and 
> larger distances from the real 'now' - ever increasing the value of 'delta'.
> 
> An intermediate approach to improve that skew would be something like below. It 
> doesn't track the remainder like your patch does, but doesn't lose precision 
> either, just rounds down 'now' to the nearest 1024 boundary.
> 
> Does this fix the regression you observed as well? Totally untested.
>

Yup this fixes my problem as well, I'm good with this if you prefer this
approach.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-03  7:26 ` Vincent Guittot
@ 2017-07-03 13:41   ` Josef Bacik
  0 siblings, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2017-07-03 13:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: josef, mingo, Peter Zijlstra, linux-kernel, kernel-team, Josef Bacik

On Mon, Jul 03, 2017 at 09:26:10AM +0200, Vincent Guittot wrote:
> Hi Josef,
> 
> On 30 June 2017 at 03:56,  <josef@toxicpanda.com> wrote:
> > From: Josef Bacik <jbacik@fb.com>
> >
> > We only track the load avg of a se in 1024 ns chunks, so in order to
> > make up for the loss of the < 1024 ns part of a run/sleep delta we only
> > add the time we processed to the se->avg.last_update_time.  The problem
> > is there is no way to know if this extra time was while we were asleep
> > or while we were running.  Instead keep track of the remainder and apply
> > it in the appropriate place.  If the remainder was while we were
> > running, add it to the delta the next time we update the load avg while
> > running, and the same for sleeping.  This (coupled with other fixes)
> > mostly fixes the regression to my workload introduced by Peter's
> > experimental runnable load propagation patches.
> 
> IIUC, your workload is sensible to the fact that the min granularity
> of the load tracking is 1us ?
> The contribution seems to be quite small to have a real impact on the load_avg.
> May be rounding last_update_time to the closest value policy instead
> of the bottom value would be enough ? we would have 512ns precision
> 
> Have you got details about your use case that needs this sub
> microsecond precision ?
>

Yup here's the artificial reproducer

https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer

The problem is we put two sets of tasks in two different cgroups that have equal
weight.  One group is a cpu hog, it's never taken off of the runqueue as it
never sleeps.  The other is a process that does actual work, the reproducer has
a rt-app config file that is a rough analog of the real workload.  This one goes
to sleep and wakes up and stuff.  The task that goes to sleep and wakes up will
end up with about 75% of the time the cpu hog ends up with.  But this patch is
only 1/3 of the solution.  I'm on top of peterz's sched/experimental branch +
some fixes to fix the regression those patches introduce to my workload.

This patch is needed because the 'interactive' tasks will slowly lose load
average, which means that every time they go onto the cpu they contribute less
and less to the load of the cpu and thus screw up the load balancing.  With this
fix and all of my other fixes in place I get an even 50-50 split between the two
groups.

Note this is only for two groups with disparate levels of interactivity.  If I
put two of my sample workload in two different groups everything works out fine,
same if I put two cpu hogs in the different groups, all is well.  We only see
this huge difference if one group is on the CPU more, thus losing less load
average over time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-02  9:37 ` Ingo Molnar
  2017-07-03 13:30   ` Josef Bacik
@ 2017-07-04  9:41   ` Peter Zijlstra
  2017-07-04 10:13     ` Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2017-07-04  9:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: josef, mingo, linux-kernel, kernel-team, Josef Bacik

On Sun, Jul 02, 2017 at 11:37:18AM +0200, Ingo Molnar wrote:
> * josef@toxicpanda.com <josef@toxicpanda.com> wrote:
> 
> > From: Josef Bacik <jbacik@fb.com>
> > 
> > We only track the load avg of a se in 1024 ns chunks, so in order to
> > make up for the loss of the < 1024 ns part of a run/sleep delta we only
> > add the time we processed to the se->avg.last_update_time.  The problem
> > is there is no way to know if this extra time was while we were asleep
> > or while we were running.  Instead keep track of the remainder and apply
> > it in the appropriate place.  If the remainder was while we were
> > running, add it to the delta the next time we update the load avg while
> > running, and the same for sleeping.  This (coupled with other fixes)
> > mostly fixes the regression to my workload introduced by Peter's
> > experimental runnable load propagation patches.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> 
> > @@ -2897,12 +2904,16 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  	 * Use 1024ns as the unit of measurement since it's a reasonable
> >  	 * approximation of 1us and fast to compute.
> >  	 */
> > +	remainder = delta & (1023UL);
> > +	sa->last_update_time = now;
> > +	if (running)
> > +		sa->run_remainder = remainder;
> > +	else
> > +		sa->sleep_remainder = remainder;
> >  	delta >>= 10;
> >  	if (!delta)
> >  		return 0;
> >  
> > -	sa->last_update_time += delta << 10;
> > -
> 
> So I'm wondering, this chunk changes how sa->last_update_time is maintained in 
> ___update_load_avg(): the new code takes a precise timestamp, but the old code was 
> not taking an imprecise timestamp, but was updating it via deltas - where each 
> delta was rounded down to the nearest 1024 nsecs boundary.

Right..

> That, if this is the main code path that updates ->last_update_time, creates a 
> constant drift of rounding error that skews ->last_update_time into larger and 
> larger distances from the real 'now' - ever increasing the value of 'delta'.

Well, its a 0-sum. It doesn't drift unbounded. The difference will grow
up to 1023, at which point we'll account for it whole and we're back to
0.

The problem is that there's two states: running, blocked. And the
current scheme does not differentiate. We'll accrue the sub-block and
spill it into whatever state gets lucky.

Now, on average you'd hope that that works out and both running and
blocked get an equal number of spills pro-rata.

But apparently this isn't quite working out for Josef.

> An intermediate approach to improve that skew would be something like below. It 
> doesn't track the remainder like your patch does, but doesn't lose precision 
> either, just rounds down 'now' to the nearest 1024 boundary.


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 008c514dc241..b03703cd7989 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2965,7 +2965,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  	if (!delta)
>  		return 0;
>  
> -	sa->last_update_time += delta << 10;
> +	sa->last_update_time = now & ~1023ULL;
>  

So if we have a task that always runs <1024ns it should still get blocks
of runtime because the difference between now and now&~1023 can be !0
and spill.

I'm just not immediately seeing how its different from the 0-sum we had.
It should be identical since delta*1024 would equally land us on those
same edges (there's an offset in the differential form between the two,
but since we start with last_update_time=0, the resulting edges are the
same afaict).


*confused*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04  9:41   ` Peter Zijlstra
@ 2017-07-04 10:13     ` Ingo Molnar
  2017-07-04 12:21       ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2017-07-04 10:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: josef, mingo, linux-kernel, kernel-team, Josef Bacik


* Peter Zijlstra <peterz@infradead.org> wrote:

> > An intermediate approach to improve that skew would be something like below. 
> > It doesn't track the remainder like your patch does, but doesn't lose 
> > precision either, just rounds down 'now' to the nearest 1024 boundary.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 008c514dc241..b03703cd7989 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2965,7 +2965,7 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  	if (!delta)
> >  		return 0;
> >  
> > -	sa->last_update_time += delta << 10;
> > +	sa->last_update_time = now & ~1023ULL;
> >  
> 
> So if we have a task that always runs <1024ns it should still get blocks
> of runtime because the difference between now and now&~1023 can be !0
> and spill.

Agreed - in the first approximation I was trying to figure out why Josef was 
seeing an effect from the patch.

> I'm just not immediately seeing how its different from the 0-sum we had.
> It should be identical since delta*1024 would equally land us on those
> same edges (there's an offset in the differential form between the two,
> but since we start with last_update_time=0, the resulting edges are the
> same afaict).

So I think the difference is that this:

	sa->last_update_time = now & ~1023ULL;

is tracking the absolute value of 'now' (i.e. rq->clock in most cases) by and 
large, with a 1024 ns imprecision.

This code on the other hand:

	sa->last_update_time += delta << 10;

... in essence creates a whole new absolute clock value that slowly but surely is 
drifting away from the real rq->clock, because 'delta' is always rounded down to 
the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.

That is because:

        delta >>= 10;
	...
        sa->last_update_time += delta << 10;

Given enough time, ->last_update_time can drift a long way, and this delta:

	delta = now - sa->last_update_time;

... becomes meaningless AFAICS, because it's essentially two different clocks that 
get compared.

But I might be super confused about this myself ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04 10:13     ` Ingo Molnar
@ 2017-07-04 12:21       ` Peter Zijlstra
  2017-07-04 12:40         ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2017-07-04 12:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: josef, mingo, linux-kernel, kernel-team, Josef Bacik

On Tue, Jul 04, 2017 at 12:13:09PM +0200, Ingo Molnar wrote:
> 
> This code on the other hand:
> 
> 	sa->last_update_time += delta << 10;
> 
> ... in essence creates a whole new absolute clock value that slowly but surely is 
> drifting away from the real rq->clock, because 'delta' is always rounded down to 
> the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.
> 
> That is because:
> 
>         delta >>= 10;
> 	...
>         sa->last_update_time += delta << 10;
> 
> Given enough time, ->last_update_time can drift a long way, and this delta:
> 
> 	delta = now - sa->last_update_time;
> 
> ... becomes meaningless AFAICS, because it's essentially two different clocks that 
> get compared.

Thing is, once you drift over 1023 (ns) your delta increases and you
catch up again.



 A  B     C       D          E  F
 |  |     |       |          |  |
 +----+----+----+----+----+----+----+----+----+----+----+


A: now = 0
   sa->last_update_time = 0
   delta := (now - sa->last_update_time) >> 10 = 0

B: now = 614				(+614)
   delta = (614 - 0) >> 10 = 0
   sa->last_update_time += 0		(0)
   sa->last_update_time = now & ~1023	(0)

C: now = 1843				(+1229)
   delta = (1843 - 0) >> 10 = 1
   sa->last_update_time += 1024		(1024)
   sa->last_update_time = now & ~1023	(1024)


D: now = 3481				(+1638)
   delta = (3481 - 1024) >> 10 = 2
   sa->last_update_time += 2048		(3072)
   sa->last_update_time = now & ~1023	(3072)

E: now = 5734				(+2253)
   delta = (5734 - 3072) = 2
   sa->last_update_time += 2048		(5120)
   sa->last_update_time = now & ~1023	(5120)

F: now = 6348				(+614)
   delta = (6348 - 5120) >> 10 = 1
   sa->last_update_time += 1024		(6144)
   sa->last_update_time = now & ~1023	(6144)



And you'll see that both are identical, and that both D and F have
gotten a spill from sub-chunk accounting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04 12:21       ` Peter Zijlstra
@ 2017-07-04 12:40         ` Peter Zijlstra
  2017-07-04 12:47           ` Josef Bacik
  2017-07-04 13:51           ` Josef Bacik
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2017-07-04 12:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: josef, mingo, linux-kernel, kernel-team, Josef Bacik

On Tue, Jul 04, 2017 at 02:21:50PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2017 at 12:13:09PM +0200, Ingo Molnar wrote:
> > 
> > This code on the other hand:
> > 
> > 	sa->last_update_time += delta << 10;
> > 
> > ... in essence creates a whole new absolute clock value that slowly but surely is 
> > drifting away from the real rq->clock, because 'delta' is always rounded down to 
> > the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.
> > 
> > That is because:
> > 
> >         delta >>= 10;
> > 	...
> >         sa->last_update_time += delta << 10;
> > 
> > Given enough time, ->last_update_time can drift a long way, and this delta:
> > 
> > 	delta = now - sa->last_update_time;
> > 
> > ... becomes meaningless AFAICS, because it's essentially two different clocks that 
> > get compared.
> 
> Thing is, once you drift over 1023 (ns) your delta increases and you
> catch up again.
> 
> 
> 
>  A  B     C       D          E  F
>  |  |     |       |          |  |
>  +----+----+----+----+----+----+----+----+----+----+----+
> 
> 
> A: now = 0
>    sa->last_update_time = 0
>    delta := (now - sa->last_update_time) >> 10 = 0
> 
> B: now = 614				(+614)
>    delta = (614 - 0) >> 10 = 0
>    sa->last_update_time += 0		(0)
>    sa->last_update_time = now & ~1023	(0)
> 
> C: now = 1843				(+1229)
>    delta = (1843 - 0) >> 10 = 1
>    sa->last_update_time += 1024		(1024)
>    sa->last_update_time = now & ~1023	(1024)
> 
> 
> D: now = 3481				(+1638)
>    delta = (3481 - 1024) >> 10 = 2
>    sa->last_update_time += 2048		(3072)
>    sa->last_update_time = now & ~1023	(3072)
> 
> E: now = 5734				(+2253)
>    delta = (5734 - 3072) = 2
>    sa->last_update_time += 2048		(5120)
>    sa->last_update_time = now & ~1023	(5120)
> 
> F: now = 6348				(+614)
>    delta = (6348 - 5120) >> 10 = 1
>    sa->last_update_time += 1024		(6144)
>    sa->last_update_time = now & ~1023	(6144)
> 
> 
> 
> And you'll see that both are identical, and that both D and F have
> gotten a spill from sub-chunk accounting.


Where the two approaches differ is when we have different modifications
to sa->last_update_time (and we do).

The differential (+=) one does not mandate initial value of
->last_update_time has the bottom 9 bits cleared. It will simply
continue from wherever.

The absolute (&) one however mandates that ->last_update_time always has
the bottom few bits 0, otherwise we can 'gain' time. The first iteration
will clear those bits and we'll then double account them.

It so happens that we have an explicit assign in migrate
(attach_entity_load_avg / set_task_rq_fair). And on negative delta. In
all those cases we use the immediate 'now' value, no clearing of bottom
bits.

The differential should work fine with that, the absolute one has double
accounting issues in that case.

So it would be very good to find what exactly causes Josef's workload to
get 'fixed'.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04 12:40         ` Peter Zijlstra
@ 2017-07-04 12:47           ` Josef Bacik
  2017-07-04 13:51           ` Josef Bacik
  1 sibling, 0 replies; 12+ messages in thread
From: Josef Bacik @ 2017-07-04 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, josef, mingo, linux-kernel, kernel-team, Josef Bacik

On Tue, Jul 04, 2017 at 02:40:03PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2017 at 02:21:50PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 04, 2017 at 12:13:09PM +0200, Ingo Molnar wrote:
> > > 
> > > This code on the other hand:
> > > 
> > > 	sa->last_update_time += delta << 10;
> > > 
> > > ... in essence creates a whole new absolute clock value that slowly but surely is 
> > > drifting away from the real rq->clock, because 'delta' is always rounded down to 
> > > the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.
> > > 
> > > That is because:
> > > 
> > >         delta >>= 10;
> > > 	...
> > >         sa->last_update_time += delta << 10;
> > > 
> > > Given enough time, ->last_update_time can drift a long way, and this delta:
> > > 
> > > 	delta = now - sa->last_update_time;
> > > 
> > > ... becomes meaningless AFAICS, because it's essentially two different clocks that 
> > > get compared.
> > 
> > Thing is, once you drift over 1023 (ns) your delta increases and you
> > catch up again.
> > 
> > 
> > 
> >  A  B     C       D          E  F
> >  |  |     |       |          |  |
> >  +----+----+----+----+----+----+----+----+----+----+----+
> > 
> > 
> > A: now = 0
> >    sa->last_update_time = 0
> >    delta := (now - sa->last_update_time) >> 10 = 0
> > 
> > B: now = 614				(+614)
> >    delta = (614 - 0) >> 10 = 0
> >    sa->last_update_time += 0		(0)
> >    sa->last_update_time = now & ~1023	(0)
> > 
> > C: now = 1843				(+1229)
> >    delta = (1843 - 0) >> 10 = 1
> >    sa->last_update_time += 1024		(1024)
> >    sa->last_update_time = now & ~1023	(1024)
> > 
> > 
> > D: now = 3481				(+1638)
> >    delta = (3481 - 1024) >> 10 = 2
> >    sa->last_update_time += 2048		(3072)
> >    sa->last_update_time = now & ~1023	(3072)
> > 
> > E: now = 5734				(+2253)
> >    delta = (5734 - 3072) = 2
> >    sa->last_update_time += 2048		(5120)
> >    sa->last_update_time = now & ~1023	(5120)
> > 
> > F: now = 6348				(+614)
> >    delta = (6348 - 5120) >> 10 = 1
> >    sa->last_update_time += 1024		(6144)
> >    sa->last_update_time = now & ~1023	(6144)
> > 
> > 
> > 
> > And you'll see that both are identical, and that both D and F have
> > gotten a spill from sub-chunk accounting.
> 
> 
> Where the two approaches differ is when we have different modifications
> to sa->last_update_time (and we do).
> 
> The differential (+=) one does not mandate initial value of
> ->last_update_time has the bottom 9 bits cleared. It will simply
> continue from wherever.
> 
> The absolute (&) one however mandates that ->last_update_time always has
> the bottom few bits 0, otherwise we can 'gain' time. The first iteration
> will clear those bits and we'll then double account them.
> 
> It so happens that we have an explicit assign in migrate
> (attach_entity_load_avg / set_task_rq_fair). And on negative delta. In
> all those cases we use the immediate 'now' value, no clearing of bottom
> bits.
> 
> The differential should work fine with that, the absolute one has double
> accounting issues in that case.
> 
> So it would be very good to find what exactly causes Josef's workload to
> get 'fixed'.

Sorry let me experiment some more, like I said this is one of 4 patches I need
to actually fix my workload, and I've tested so many iterations of this problem
that I may be _thinking_ it affects things but it really doesn't.  I'll re-test
with the normal code and the other 3 patches in place and see if things are ok.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04 12:40         ` Peter Zijlstra
  2017-07-04 12:47           ` Josef Bacik
@ 2017-07-04 13:51           ` Josef Bacik
  2017-07-04 14:28             ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Josef Bacik @ 2017-07-04 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, josef, mingo, linux-kernel, kernel-team, Josef Bacik

On Tue, Jul 04, 2017 at 02:40:03PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 04, 2017 at 02:21:50PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 04, 2017 at 12:13:09PM +0200, Ingo Molnar wrote:
> > > 
> > > This code on the other hand:
> > > 
> > > 	sa->last_update_time += delta << 10;
> > > 
> > > ... in essence creates a whole new absolute clock value that slowly but surely is 
> > > drifting away from the real rq->clock, because 'delta' is always rounded down to 
> > > the nearest 1024 ns boundary, so we accumulate the 'remainder' losses.
> > > 
> > > That is because:
> > > 
> > >         delta >>= 10;
> > > 	...
> > >         sa->last_update_time += delta << 10;
> > > 
> > > Given enough time, ->last_update_time can drift a long way, and this delta:
> > > 
> > > 	delta = now - sa->last_update_time;
> > > 
> > > ... becomes meaningless AFAICS, because it's essentially two different clocks that 
> > > get compared.
> > 
> > Thing is, once you drift over 1023 (ns) your delta increases and you
> > catch up again.
> > 
> > 
> > 
> >  A  B     C       D          E  F
> >  |  |     |       |          |  |
> >  +----+----+----+----+----+----+----+----+----+----+----+
> > 
> > 
> > A: now = 0
> >    sa->last_update_time = 0
> >    delta := (now - sa->last_update_time) >> 10 = 0
> > 
> > B: now = 614				(+614)
> >    delta = (614 - 0) >> 10 = 0
> >    sa->last_update_time += 0		(0)
> >    sa->last_update_time = now & ~1023	(0)
> > 
> > C: now = 1843				(+1229)
> >    delta = (1843 - 0) >> 10 = 1
> >    sa->last_update_time += 1024		(1024)
> >    sa->last_update_time = now & ~1023	(1024)
> > 
> > 
> > D: now = 3481				(+1638)
> >    delta = (3481 - 1024) >> 10 = 2
> >    sa->last_update_time += 2048		(3072)
> >    sa->last_update_time = now & ~1023	(3072)
> > 
> > E: now = 5734				(+2253)
> >    delta = (5734 - 3072) = 2
> >    sa->last_update_time += 2048		(5120)
> >    sa->last_update_time = now & ~1023	(5120)
> > 
> > F: now = 6348				(+614)
> >    delta = (6348 - 5120) >> 10 = 1
> >    sa->last_update_time += 1024		(6144)
> >    sa->last_update_time = now & ~1023	(6144)
> > 
> > 
> > 
> > And you'll see that both are identical, and that both D and F have
> > gotten a spill from sub-chunk accounting.
> 
> 
> Where the two approaches differ is when we have different modifications
> to sa->last_update_time (and we do).
> 
> The differential (+=) one does not mandate initial value of
> ->last_update_time has the bottom 9 bits cleared. It will simply
> continue from wherever.
> 
> The absolute (&) one however mandates that ->last_update_time always has
> the bottom few bits 0, otherwise we can 'gain' time. The first iteration
> will clear those bits and we'll then double account them.
> 
> It so happens that we have an explicit assign in migrate
> (attach_entity_load_avg / set_task_rq_fair). And on negative delta. In
> all those cases we use the immediate 'now' value, no clearing of bottom
> bits.
> 
> The differential should work fine with that, the absolute one has double
> accounting issues in that case.
> 
> So it would be very good to find what exactly causes Josef's workload to
> get 'fixed'.

Sorry everybody, I thought I had tested all of the patches minus this one, but
apparently I did not.  Re-testing with the original code and my other patches
verified that the problem is still fixed, so this isn't needed.  Sorry for the
noise,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH] sched: attach extra runtime to the right avg
  2017-07-04 13:51           ` Josef Bacik
@ 2017-07-04 14:28             ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2017-07-04 14:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Ingo Molnar, mingo, linux-kernel, kernel-team, Josef Bacik

On Tue, Jul 04, 2017 at 09:51:25AM -0400, Josef Bacik wrote:
> Sorry everybody, I thought I had tested all of the patches minus this one, but
> apparently I did not.  Re-testing with the original code and my other patches
> verified that the problem is still fixed, so this isn't needed.  Sorry for the
> noise,

N/P, thanks for testing!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-04 14:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-30  1:56 [RFC][PATCH] sched: attach extra runtime to the right avg josef
2017-07-02  9:37 ` Ingo Molnar
2017-07-03 13:30   ` Josef Bacik
2017-07-04  9:41   ` Peter Zijlstra
2017-07-04 10:13     ` Ingo Molnar
2017-07-04 12:21       ` Peter Zijlstra
2017-07-04 12:40         ` Peter Zijlstra
2017-07-04 12:47           ` Josef Bacik
2017-07-04 13:51           ` Josef Bacik
2017-07-04 14:28             ` Peter Zijlstra
2017-07-03  7:26 ` Vincent Guittot
2017-07-03 13:41   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).