linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] sched: Enhance/fix sched_avg_update()
@ 2016-06-14 15:27 Frederic Weisbecker
  2016-06-14 15:28 ` [PATCH 1/3] sched: Introduce sched_time_avg_ns minimal value Frederic Weisbecker
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2016-06-14 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Mike Galbraith, Thomas Gleixner

Hi,

Here is a fix on sched_time_avg_ms sysctl value and an enhancement on
nohz scenarii.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	sched/core

HEAD: 8e6e088b73c2172c45cc3aadab3d4971ae47215b

Thanks,
	Frederic
---

Frederic Weisbecker (3):
      sched: Introduce sched_time_avg_ns minimal value
      sched: Unloop sched avg decaying
      Debug: Alternate old/new version of sched_avg_update calls for profiling


 kernel/sched/core.c  | 18 +++++++++++++++++-
 kernel/sched/fair.c  |  7 ++++++-
 kernel/sched/sched.h |  8 +++++++-
 kernel/sysctl.c      |  3 ++-
 4 files changed, 32 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] sched: Introduce sched_time_avg_ns minimal value
  2016-06-14 15:27 [PATCH 0/3] sched: Enhance/fix sched_avg_update() Frederic Weisbecker
@ 2016-06-14 15:28 ` Frederic Weisbecker
  2016-06-14 15:28 ` [PATCH 2/3] sched: Unloop sched avg decaying Frederic Weisbecker
  2016-06-14 15:28 ` [NOT-FOR-MERGE-PATCH 3/3] Debug: Alternate old/new version of sched_avg_update calls for profiling Frederic Weisbecker
  2 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2016-06-14 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Mike Galbraith, Thomas Gleixner

Writing 0 to sysctl value "kernel.sched_time_avg_ms" triggers a lockup:

	NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
	Modules linked in:
	irq event stamp: 81894
	hardirqs last  enabled at (81893): [<ffffffff8111af88>] rcu_idle_exit+0x68/0xa0
	hardirqs last disabled at (81894): [<ffffffff8113536e>] tick_nohz_idle_exit+0x2e/0x110
	softirqs last  enabled at (81870): [<ffffffff810ab8a1>] _local_bh_enable+0x21/0x50
	softirqs last disabled at (81869): [<ffffffff810ace2b>] irq_enter+0x4b/0x70
	CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.6.0-rc4+ #269
	Hardware name: MSI MS-7850/Z87-G41 PC Mate(MS-7850), BIOS V1.3 08/18/2013
	0000000000000000 ffff88021fb05b80 ffffffff8141b023 0000000000000000
	0000000000000004 ffff88021fb05ba0 ffffffff81163fcf 0000000000000000
	ffff88021fb05c40 ffff88021fb05be0 ffffffff8119ae7c 00000001810077ce
	Call Trace:
	<NMI>  [<ffffffff8141b023>] dump_stack+0x85/0xc2
	[<ffffffff81163fcf>] watchdog_overflow_callback+0x13f/0x160
	[<ffffffff8119ae7c>] __perf_event_overflow+0x9c/0x1f0
	[<ffffffff8119baa4>] perf_event_overflow+0x14/0x20
	[<ffffffff8100d235>] intel_pmu_handle_irq+0x1d5/0x4a0
	[<ffffffff810062ad>] perf_event_nmi_handler+0x2d/0x50
	[<ffffffff81174397>] ? tracing_generic_entry_update+0x97/0xb0
	[<ffffffff81039cef>] nmi_handle+0xbf/0x2f0
	[<ffffffff81039c35>] ? nmi_handle+0x5/0x2f0
	[<ffffffff81174397>] ? tracing_generic_entry_update+0x97/0xb0
	[<ffffffff8103a171>] default_do_nmi+0x71/0x1b0
	[<ffffffff8103a3c5>] do_nmi+0x115/0x170
	[<ffffffff8194b511>] end_repeat_nmi+0x1a/0x1e
	[<ffffffff81174397>] ? tracing_generic_entry_update+0x97/0xb0
	[<ffffffff81174397>] ? tracing_generic_entry_update+0x97/0xb0
	[<ffffffff81174397>] ? tracing_generic_entry_update+0x97/0xb0
	<<EOE>>  [<ffffffff81177d3f>] ? trace_buffer_lock_reserve+0x3f/0x60
	[<ffffffff810d3d51>] ? sched_avg_update+0x51/0xc0
	[<ffffffff811781e5>] __trace_bputs+0x75/0x100
	[<ffffffff810d3d79>] sched_avg_update+0x79/0xc0
	[<ffffffff810dedc3>] cpu_load_update+0xa3/0xd0
	[<ffffffff810e694a>] cpu_load_update_nohz_stop+0x7a/0x80
	[<ffffffff811353b7>] tick_nohz_idle_exit+0x77/0x110
	[<ffffffff810f0866>] cpu_startup_entry+0x176/0x420
	[<ffffffff810589f4>] start_secondary+0x104/0x110

This is due to the loop in sched_avg_update() that fails to make any
progress when the update delay is 0.

Since this is non-sense to set that value to 0 anyway, force a minimum
of at least 1 millisecond.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87b2fc3..b730dd6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -345,7 +345,8 @@ static struct ctl_table kern_table[] = {
 		.data		= &sysctl_sched_time_avg,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
 	},
 	{
 		.procname	= "sched_shares_window_ns",
-- 
2.7.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/3] sched: Unloop sched avg decaying
  2016-06-14 15:27 [PATCH 0/3] sched: Enhance/fix sched_avg_update() Frederic Weisbecker
  2016-06-14 15:28 ` [PATCH 1/3] sched: Introduce sched_time_avg_ns minimal value Frederic Weisbecker
@ 2016-06-14 15:28 ` Frederic Weisbecker
  2016-06-14 15:58   ` Peter Zijlstra
  2016-06-14 15:28 ` [NOT-FOR-MERGE-PATCH 3/3] Debug: Alternate old/new version of sched_avg_update calls for profiling Frederic Weisbecker
  2 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2016-06-14 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Mike Galbraith, Thomas Gleixner

The code that computes sched rt avg decaying periodically (every 0.5
seconds) catches up with lost updates within a loop. This is ok when
CPUs sleep in dynticks mode for short periods of time as they only
miss a few updates that are caught up through a few iterations. But CPUs
can sleep for undefined periods of time, leading to unbound number of
loop iterations to catch up. As a result, the computation time of
sched_avg_update() can virtually induce some latency issue.

The solution in this patch propose to convert this loop into a single
division. Testing on idle load has shown positive results. Below
is showned a function profile comparison between the upstream version
(renamed as sched_avg_update_old) against the new version
(sched_avg_update) with both being called alternatively.

New version can be up to 2 times faster on average.

	Function                                         Hit    Time            Avg             s^2
	--------                                         ---    ----            ---             ---
	trace_stat/function0:  sched_avg_update_old   135120    47612.36 us     0.352 us        1049.606 us
	trace_stat/function0:  sched_avg_update       135120    42668.44 us     0.315 us        1617.970 us
	trace_stat/function1:  sched_avg_update_old   132074    46618.41 us     0.352 us        3653.566 us
	trace_stat/function1:  sched_avg_update       132074    41356.76 us     0.313 us        3161.585 us
	trace_stat/function2:  sched_avg_update_old   121335    43381.10 us     0.357 us        1262.240 us
	trace_stat/function2:  sched_avg_update       121335    38240.72 us     0.315 us        970.400 us
	trace_stat/function3:  sched_avg_update_old    37563    17640.67 us     0.469 us        148.206 us
	trace_stat/function3:  sched_avg_update        37563    11059.65 us     0.294 us        126.548 us
	trace_stat/function4:  sched_avg_update_old    24200    13557.05 us     0.560 us        57.124 us
	trace_stat/function4:  sched_avg_update        24200    6687.281 us     0.276 us        9.528 us
	trace_stat/function5:  sched_avg_update_old    28572    15158.76 us     0.530 us        674.049 us
	trace_stat/function5:  sched_avg_update        28573    8012.361 us     0.280 us        181.687 us
	trace_stat/function6:  sched_avg_update_old    23424    12987.64 us     0.554 us        27.639 us
	trace_stat/function6:  sched_avg_update        23425    6264.433 us     0.267 us        2.965 us
	trace_stat/function7:  sched_avg_update_old    22192    13083.69 us     0.589 us        25.605 us
	trace_stat/function7:  sched_avg_update        22191    5947.785 us     0.268 us        1.748 us

Below is a snapshot of the same function profile under full buzy load
(perf bench sched messaging). Old and new versions there show roughly
the same results with tiny 2 or 3 nsecs in favour of the old version
which shouldn't matter much as it is called every 0.5 seconds.

	Function                                         Hit    Time            Avg             s^2
	--------                                         ---    ----            ---             ---
	trace_stat/function0:  sched_avg_update        106699    8029.961 us     0.075 us        0.244 us
	trace_stat/function0:  sched_avg_update_old    106698    7852.948 us     0.073 us        0.313 us
	trace_stat/function1:  sched_avg_update        106547    8066.833 us     0.075 us        1.256 us
	trace_stat/function1:  sched_avg_update_old    106547    7794.896 us     0.073 us        1.521 us
	trace_stat/function2:  sched_avg_update        106527    8049.326 us     0.075 us        1.141 us
	trace_stat/function2:  sched_avg_update_old    106528    7818.155 us     0.073 us        1.052 us
	trace_stat/function3:  sched_avg_update        106534    8056.079 us     0.075 us        0.342 us
	trace_stat/function3:  sched_avg_update_old    106535    7815.416 us     0.073 us        0.369 us
	trace_stat/function4:  sched_avg_update        106433    8090.462 us     0.076 us        5.359 us
	trace_stat/function4:  sched_avg_update_old    106433    7879.694 us     0.074 us        0.433 us
	trace_stat/function5:  sched_avg_update        106426    8127.800 us     0.076 us        1.304 us
	trace_stat/function5:  sched_avg_update_old    106425    7854.538 us     0.073 us        2.466 us
	trace_stat/function6:  sched_avg_update        106436    8067.921 us     0.075 us        0.257 us
	trace_stat/function6:  sched_avg_update_old    106436    7830.492 us     0.073 us        0.334 us
	trace_stat/function7:  sched_avg_update        106427    8135.786 us     0.076 us        13.181 us
	trace_stat/function7:  sched_avg_update_old    106428    7940.925 us     0.074 us        0.982 us

As a conclusion, the new version behaves roughly the same under buzy
load but has a faster and much more contained behaviour under idle load.

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/core.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c947..0c0578a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -666,17 +666,17 @@ bool sched_can_stop_tick(struct rq *rq)
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
+	s64 delta;
+	u64 rem;
+	int pending;
 
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (rq->age_stamp));
-		rq->age_stamp += period;
-		rq->rt_avg /= 2;
-	}
+	delta = (s64)(rq_clock(rq) - rq->age_stamp);
+	if (delta <= period)
+		return;
+
+	pending = div64_u64_rem(delta, period, &rem);
+	rq->age_stamp += delta - rem;
+	rq->rt_avg >>= pending;
 }
 
 #endif /* CONFIG_SMP */
-- 
2.7.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [NOT-FOR-MERGE-PATCH 3/3] Debug: Alternate old/new version of sched_avg_update calls for profiling
  2016-06-14 15:27 [PATCH 0/3] sched: Enhance/fix sched_avg_update() Frederic Weisbecker
  2016-06-14 15:28 ` [PATCH 1/3] sched: Introduce sched_time_avg_ns minimal value Frederic Weisbecker
  2016-06-14 15:28 ` [PATCH 2/3] sched: Unloop sched avg decaying Frederic Weisbecker
@ 2016-06-14 15:28 ` Frederic Weisbecker
  2 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2016-06-14 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Mike Galbraith, Thomas Gleixner

This way we can measure with the function profiler.

	trace_stat/function0:  sched_avg_update_old                   764    223.936 us      0.293 us        0.017 us
	trace_stat/function0:  sched_avg_update                       764    199.818 us      0.261 us        0.006 us
	trace_stat/function1:  sched_avg_update_old                   252    99.225 us       0.393 us        0.106 us
	trace_stat/function1:  sched_avg_update                       252    59.080 us       0.234 us        0.008 us
	trace_stat/function2:  sched_avg_update_old                   606    202.465 us      0.334 us        0.007 us
	trace_stat/function2:  sched_avg_update                       605    184.083 us      0.304 us        0.002 us
	trace_stat/function3:  sched_avg_update_old                   902    261.790 us      0.290 us        0.010 us
	trace_stat/function3:  sched_avg_update                       901    238.253 us      0.264 us        0.004 us
	trace_stat/function4:  sched_avg_update_old                   120    50.391 us       0.419 us        0.095 us
	trace_stat/function4:  sched_avg_update                       120    35.947 us       0.299 us        0.004 us
	trace_stat/function5:  sched_avg_update_old                   581    195.280 us      0.336 us        0.011 us
	trace_stat/function5:  sched_avg_update                       582    173.594 us      0.298 us        0.003 us
	trace_stat/function6:  sched_avg_update_old                   202    45.201 us       0.223 us        0.049 us
	trace_stat/function6:  sched_avg_update                       201    37.892 us       0.188 us        0.013 us
	trace_stat/function7:  sched_avg_update_old                   200    80.731 us       0.403 us        0.477 us
	trace_stat/function7:  sched_avg_update                       200    41.759 us       0.208 us        0.012 us

Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Not-Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/core.c  | 16 ++++++++++++++++
 kernel/sched/fair.c  |  7 ++++++-
 kernel/sched/sched.h |  8 +++++++-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c0578a..5410e1bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -663,6 +663,22 @@ bool sched_can_stop_tick(struct rq *rq)
 }
 #endif /* CONFIG_NO_HZ_FULL */
 
+void sched_avg_update_old(struct rq *rq)
+{
+	s64 period = sched_avg_period();
+
+	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
+		/*
+		 * Inline assembly required to prevent the compiler
+		 * optimising this loop into a divmod call.
+		 * See __iter_div_u64_rem() for another example of this.
+		 */
+		asm("" : "+rm" (rq->age_stamp));
+		rq->age_stamp += period;
+		rq->rt_avg /= 2;
+	}
+}
+
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dd8ba..1b487bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4607,6 +4607,7 @@ decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
  * see decay_load_misses(). For NOHZ_FULL we get to subtract and add the extra
  * term.
  */
+DEFINE_PER_CPU(int, cpu_lodd);
 static void cpu_load_update(struct rq *this_rq, unsigned long this_load,
 			    unsigned long pending_updates)
 {
@@ -4647,7 +4648,11 @@ static void cpu_load_update(struct rq *this_rq, unsigned long this_load,
 		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
 	}
 
-	sched_avg_update(this_rq);
+	if (__this_cpu_read(cpu_lodd) % 2)
+		sched_avg_update(this_rq);
+	else
+		sched_avg_update_old(this_rq);
+	__this_cpu_add(cpu_lodd, 1);
 }
 
 /* Used instead of source_load when we know the type == 0 */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f30..b7e197b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1427,6 +1427,7 @@ static inline int hrtick_enabled(struct rq *rq)
 #endif /* CONFIG_SCHED_HRTICK */
 
 #ifdef CONFIG_SMP
+extern void sched_avg_update_old(struct rq *rq);
 extern void sched_avg_update(struct rq *rq);
 
 #ifndef arch_scale_freq_capacity
@@ -1448,10 +1449,15 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+DECLARE_PER_CPU(int, cpu_lodd);
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-	sched_avg_update(rq);
+	if (__this_cpu_read(cpu_lodd) % 2)
+		sched_avg_update(rq);
+	else
+		sched_avg_update_old(rq);
+	__this_cpu_add(cpu_lodd, 1);
 }
 #else
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
-- 
2.7.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] sched: Unloop sched avg decaying
  2016-06-14 15:28 ` [PATCH 2/3] sched: Unloop sched avg decaying Frederic Weisbecker
@ 2016-06-14 15:58   ` Peter Zijlstra
  2016-06-30 12:52     ` Frederic Weisbecker
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2016-06-14 15:58 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: LKML, Ingo Molnar, Mike Galbraith, Thomas Gleixner

On Tue, Jun 14, 2016 at 05:28:01PM +0200, Frederic Weisbecker wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 385c947..0c0578a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -666,17 +666,17 @@ bool sched_can_stop_tick(struct rq *rq)
>  void sched_avg_update(struct rq *rq)
>  {
>  	s64 period = sched_avg_period();
> +	s64 delta;
> +	u64 rem;
> +	int pending;
>  
> +	delta = (s64)(rq_clock(rq) - rq->age_stamp);
> +	if (delta <= period)
> +		return;
> +
> +	pending = div64_u64_rem(delta, period, &rem);
> +	rq->age_stamp += delta - rem;
> +	rq->rt_avg >>= pending;
>  }

Blergh, and now do the profile on machine that doesn't have major
transistor count dedicated to divisions.

Why not add the division to the nohz exit path only?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] sched: Unloop sched avg decaying
  2016-06-14 15:58   ` Peter Zijlstra
@ 2016-06-30 12:52     ` Frederic Weisbecker
  2016-06-30 13:20       ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2016-06-30 12:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Mike Galbraith, Thomas Gleixner

On Tue, Jun 14, 2016 at 05:58:42PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 05:28:01PM +0200, Frederic Weisbecker wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 385c947..0c0578a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -666,17 +666,17 @@ bool sched_can_stop_tick(struct rq *rq)
> >  void sched_avg_update(struct rq *rq)
> >  {
> >  	s64 period = sched_avg_period();
> > +	s64 delta;
> > +	u64 rem;
> > +	int pending;
> >  
> > +	delta = (s64)(rq_clock(rq) - rq->age_stamp);
> > +	if (delta <= period)
> > +		return;
> > +
> > +	pending = div64_u64_rem(delta, period, &rem);
> > +	rq->age_stamp += delta - rem;
> > +	rq->rt_avg >>= pending;
> >  }
> 
> Blergh, and now do the profile on machine that doesn't have major
> transistor count dedicated to divisions.

Indeed I was afraid of something like that. That said I'm not sure which
is worse between iteration or division out of native asm.

Would it be worth testing?

> 
> Why not add the division to the nohz exit path only?

It would be worse I think because we may exit much more often from nohz
than we reach a sched_avg_period().

So the only safe optimization I can do for now is:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..4632d31 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -626,8 +626,9 @@ bool sched_can_stop_tick(struct rq *rq)
 void sched_avg_update(struct rq *rq)
 {
 	s64 period = sched_avg_period();
+	int i;
 
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
+	for (i = 0; (s64)(rq_clock(rq) - rq->age_stamp) > period); i++) {
 		/*
 		 * Inline assembly required to prevent the compiler
 		 * optimising this loop into a divmod call.
@@ -635,8 +636,8 @@ void sched_avg_update(struct rq *rq)
 		 */
 		asm("" : "+rm" (rq->age_stamp));
 		rq->age_stamp += period;
-		rq->rt_avg /= 2;
 	}
+	rq->rt_avg >>= i;
 }
 
 #endif /* CONFIG_SMP */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] sched: Unloop sched avg decaying
  2016-06-30 12:52     ` Frederic Weisbecker
@ 2016-06-30 13:20       ` Peter Zijlstra
  2016-07-06 11:53         ` Frederic Weisbecker
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2016-06-30 13:20 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: LKML, Ingo Molnar, Mike Galbraith, Thomas Gleixner

On Thu, Jun 30, 2016 at 02:52:26PM +0200, Frederic Weisbecker wrote:
> On Tue, Jun 14, 2016 at 05:58:42PM +0200, Peter Zijlstra wrote:

> > Why not add the division to the nohz exit path only?
> 
> It would be worse I think because we may exit much more often from nohz
> than we reach a sched_avg_period().
> 
> So the only safe optimization I can do for now is:

How about something like this then?

---

 kernel/sched/core.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3387e4f14fc9..fd1ae4c4105f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -665,9 +665,23 @@ bool sched_can_stop_tick(struct rq *rq)
 
 void sched_avg_update(struct rq *rq)
 {
-	s64 period = sched_avg_period();
+	s64 delta, period = sched_avg_period();
 
-	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
+	delta = (s64)(rq_clock(rq) - rq->age_stamp);
+	if (likely(delta < period))
+		return;
+
+	if (unlikely(delta > 3*period)) {
+		int pending;
+		u64 rem;
+
+		pending = div64_u64_rem(delta, period, &rem);
+		rq->age_stamp += delta - rem;
+		rq->rt_avg >>= pending;
+		return;
+	}
+
+	while (delta > period) {
 		/*
 		 * Inline assembly required to prevent the compiler
 		 * optimising this loop into a divmod call.
@@ -675,6 +689,7 @@ void sched_avg_update(struct rq *rq)
 		 */
 		asm("" : "+rm" (rq->age_stamp));
 		rq->age_stamp += period;
+		delta -= period;
 		rq->rt_avg /= 2;
 	}
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] sched: Unloop sched avg decaying
  2016-06-30 13:20       ` Peter Zijlstra
@ 2016-07-06 11:53         ` Frederic Weisbecker
  0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2016-07-06 11:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Ingo Molnar, Mike Galbraith, Thomas Gleixner

On Thu, Jun 30, 2016 at 03:20:37PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 30, 2016 at 02:52:26PM +0200, Frederic Weisbecker wrote:
> > On Tue, Jun 14, 2016 at 05:58:42PM +0200, Peter Zijlstra wrote:
> 
> > > Why not add the division to the nohz exit path only?
> > 
> > It would be worse I think because we may exit much more often from nohz
> > than we reach a sched_avg_period().
> > 
> > So the only safe optimization I can do for now is:
> 
> How about something like this then?
> 
> ---
> 
>  kernel/sched/core.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3387e4f14fc9..fd1ae4c4105f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -665,9 +665,23 @@ bool sched_can_stop_tick(struct rq *rq)
>  
>  void sched_avg_update(struct rq *rq)
>  {
> -	s64 period = sched_avg_period();
> +	s64 delta, period = sched_avg_period();
>  
> -	while ((s64)(rq_clock(rq) - rq->age_stamp) > period) {
> +	delta = (s64)(rq_clock(rq) - rq->age_stamp);
> +	if (likely(delta < period))
> +		return;
> +
> +	if (unlikely(delta > 3*period)) {
> +		int pending;
> +		u64 rem;
> +
> +		pending = div64_u64_rem(delta, period, &rem);
> +		rq->age_stamp += delta - rem;
> +		rq->rt_avg >>= pending;
> +		return;
> +	}
> +
> +	while (delta > period) {
>  		/*
>  		 * Inline assembly required to prevent the compiler
>  		 * optimising this loop into a divmod call.
> @@ -675,6 +689,7 @@ void sched_avg_update(struct rq *rq)
>  		 */
>  		asm("" : "+rm" (rq->age_stamp));
>  		rq->age_stamp += period;
> +		delta -= period;
>  		rq->rt_avg /= 2;
>  	}
>  }

Makes sense. I'm going to do some tests. We might want to precompute 3*period maybe.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-06 11:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 15:27 [PATCH 0/3] sched: Enhance/fix sched_avg_update() Frederic Weisbecker
2016-06-14 15:28 ` [PATCH 1/3] sched: Introduce sched_time_avg_ns minimal value Frederic Weisbecker
2016-06-14 15:28 ` [PATCH 2/3] sched: Unloop sched avg decaying Frederic Weisbecker
2016-06-14 15:58   ` Peter Zijlstra
2016-06-30 12:52     ` Frederic Weisbecker
2016-06-30 13:20       ` Peter Zijlstra
2016-07-06 11:53         ` Frederic Weisbecker
2016-06-14 15:28 ` [NOT-FOR-MERGE-PATCH 3/3] Debug: Alternate old/new version of sched_avg_update calls for profiling Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).