linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] timers/nohz: Update nohz load even if tick already stopped
@ 2019-10-28 15:07 Frederic Weisbecker
  2019-10-29 10:05 ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Frederic Weisbecker @ 2019-10-28 15:07 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: LKML, Scott Wood, Frederic Weisbecker

From: Scott Wood <swood@redhat.com>

The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time. We've had reports of a loadavg near 150 on a mostly idle
system.

Calling calc_load_nohz_start() regardless of whether the tick is already
stopped addresses the issue when going idle. Tracking load changes when
not going idle (e.g. multiple SCHED_FIFO tasks coming and going) is not
addressed by this patch.

Signed-off-by: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c2748232f607..f55ddeb652a3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -763,6 +763,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 		ts->do_timer_last = 0;
 	}
 
+	/* Even if the tick was already stopped, load may have changed */
+	calc_load_nohz_start();
+
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
@@ -783,7 +786,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 	 * the scheduler tick in nohz_restart_sched_tick.
 	 */
 	if (!ts->tick_stopped) {
-		calc_load_nohz_start();
 		quiet_vmstat();
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-10-28 15:07 [PATCH] timers/nohz: Update nohz load even if tick already stopped Frederic Weisbecker
@ 2019-10-29 10:05 ` Peter Zijlstra
  2019-10-30  8:48   ` Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2019-10-29 10:05 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Thomas Gleixner, Ingo Molnar, LKML, Scott Wood

On Mon, Oct 28, 2019 at 04:07:16PM +0100, Frederic Weisbecker wrote:
> From: Scott Wood <swood@redhat.com>
> 
> The way loadavg is tracked during nohz only pays attention to the load
> upon entering nohz. This can be particularly noticeable if nohz is
> entered while non-idle, and then the cpu goes idle and stays that way for
> a long time. We've had reports of a loadavg near 150 on a mostly idle
> system.
> 
> Calling calc_load_nohz_start() regardless of whether the tick is already
> stopped addresses the issue when going idle. Tracking load changes when
> not going idle (e.g. multiple SCHED_FIFO tasks coming and going) is not
> addressed by this patch.

Hurph, is that phenomena you describe NOHZ or NOHZ_FULL? Because that
second thing you talk about, multiple SCHED_FIFO tasks running without a
tick is definitely NOHZ_FULL.

I'm thinking all of this is NOHZ_FULL because IIRC we always start the
tick when there is a runnable task. So your example of going idle in
NOHZ already cannot happen for regular NOHZ.

And for loadavg vs NOHZ_FULL I don't much like this patch. It ductapes
one symptom but doesn't address the actual issue. Let me go rediscover
how loadavg-nohz works again, I so hate all this crap :/

Bah, I can't get me head straight, but I'm thinking something like the
below ought to do. Does it actually work?

---
 include/linux/sched/nohz.h |  2 ++
 kernel/sched/core.c        |  3 +++
 kernel/sched/loadavg.c     | 26 +++++++++++++++++++-------
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 1abe91ff6e4a..a2296004351f 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -16,9 +16,11 @@ static inline void nohz_balance_enter_idle(int cpu) { }
 #ifdef CONFIG_NO_HZ_COMMON
 void calc_load_nohz_start(void);
 void calc_load_nohz_stop(void);
+void calc_load_nohz_remote(int cpu);
 #else
 static inline void calc_load_nohz_start(void) { }
 static inline void calc_load_nohz_stop(void) { }
+static inline void calc_load_nohz_remote(int cpu) { }
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb42b71faab9..209e50d48f80 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3666,6 +3666,8 @@ static void sched_tick_remote(struct work_struct *work)
 	 * having one too much is no big deal because the scheduler tick updates
 	 * statistics and checks timeslices in a time-independent way, regardless
 	 * of when exactly it is running.
+	 *
+	 * XXX should we be checking tick_nohz_tick_stopped_cpu() under rq->lock ?
 	 */
 	if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
 		goto out_requeue;
@@ -3686,6 +3688,7 @@ static void sched_tick_remote(struct work_struct *work)
 	curr->sched_class->task_tick(rq, curr, 0);
 
 out_unlock:
+	calc_load_nohz_remote(cpu);
 	rq_unlock_irq(rq, &rf);
 
 out_requeue:
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 28a516575c18..7549bd9b6853 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -231,16 +231,11 @@ static inline int calc_load_read_idx(void)
 	return calc_load_idx & 1;
 }
 
-void calc_load_nohz_start(void)
+static void calc_load_nohz_fold(struct rq *rq)
 {
-	struct rq *this_rq = this_rq();
 	long delta;
 
-	/*
-	 * We're going into NO_HZ mode, if there's any pending delta, fold it
-	 * into the pending NO_HZ delta.
-	 */
-	delta = calc_load_fold_active(this_rq, 0);
+	delta = calc_load_fold_active(rq, 0);
 	if (delta) {
 		int idx = calc_load_write_idx();
 
@@ -248,6 +243,23 @@ void calc_load_nohz_start(void)
 	}
 }
 
+void calc_load_nohz_start(void)
+{
+	/*
+	 * We're going into NO_HZ mode, if there's any pending delta, fold it
+	 * into the pending NO_HZ delta.
+	 */
+	calc_load_nohz_fold(this_rq());
+}
+
+void calc_load_nohz_remote(int cpu)
+{
+	/*
+	 * XXX comment goes here.
+	 */
+	calc_load_nohz_fold(cpu_rq(cpu));
+}
+
 void calc_load_nohz_stop(void)
 {
 	struct rq *this_rq = this_rq();

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-10-29 10:05 ` Peter Zijlstra
@ 2019-10-30  8:48   ` Scott Wood
  2019-10-30 13:31     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Scott Wood @ 2019-10-30  8:48 UTC (permalink / raw)
  To: Peter Zijlstra, Frederic Weisbecker; +Cc: Thomas Gleixner, Ingo Molnar, LKML

On Tue, 2019-10-29 at 11:05 +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2019 at 04:07:16PM +0100, Frederic Weisbecker wrote:
> > From: Scott Wood <swood@redhat.com>
> > 
> > The way loadavg is tracked during nohz only pays attention to the load
> > upon entering nohz. This can be particularly noticeable if nohz is
> > entered while non-idle, and then the cpu goes idle and stays that way
> > for
> > a long time. We've had reports of a loadavg near 150 on a mostly idle
> > system.
> > 
> > Calling calc_load_nohz_start() regardless of whether the tick is already
> > stopped addresses the issue when going idle. Tracking load changes when
> > not going idle (e.g. multiple SCHED_FIFO tasks coming and going) is not
> > addressed by this patch.
> 
> Hurph, is that phenomena you describe NOHZ or NOHZ_FULL? Because that
> second thing you talk about, multiple SCHED_FIFO tasks running without a
> tick is definitely NOHZ_FULL.
> 
> I'm thinking all of this is NOHZ_FULL because IIRC we always start the
> tick when there is a runnable task. So your example of going idle in
> NOHZ already cannot happen for regular NOHZ.

Yes, NOHZ_FULL (sorry for not stating that clearly).

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index eb42b71faab9..209e50d48f80 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3666,6 +3666,8 @@ static void sched_tick_remote(struct work_struct
> *work)
>  	 * having one too much is no big deal because the scheduler tick
> updates
>  	 * statistics and checks timeslices in a time-independent way,
> regardless
>  	 * of when exactly it is running.
> +	 *
> +	 * XXX should we be checking tick_nohz_tick_stopped_cpu() under rq-
> >lock ?
>  	 */
>  	if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
>  		goto out_requeue;
> @@ -3686,6 +3688,7 @@ static void sched_tick_remote(struct work_struct
> *work)
>  	curr->sched_class->task_tick(rq, curr, 0);
>  
>  out_unlock:
> +	calc_load_nohz_remote(cpu);
>  	rq_unlock_irq(rq, &rf);

This gets skipped when the cpu is idle, so it still misses the update.

-Scott



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-10-30  8:48   ` Scott Wood
@ 2019-10-30 13:31     ` Peter Zijlstra
  2019-11-01  5:11       ` Scott Wood
  2019-12-11 20:46       ` Scott Wood
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2019-10-30 13:31 UTC (permalink / raw)
  To: Scott Wood; +Cc: Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, LKML

On Wed, Oct 30, 2019 at 03:48:26AM -0500, Scott Wood wrote:
> On Tue, 2019-10-29 at 11:05 +0100, Peter Zijlstra wrote:

> > @@ -3686,6 +3688,7 @@ static void sched_tick_remote(struct work_struct
> > *work)
> >  	curr->sched_class->task_tick(rq, curr, 0);
> >  
> >  out_unlock:
> > +	calc_load_nohz_remote(cpu);
> >  	rq_unlock_irq(rq, &rf);
> 
> This gets skipped when the cpu is idle, so it still misses the update.

Oh argh! that's a bit radical of the remote tick. The normal tick runs
just fine on idle CPUs, so lets mirror that.

How's this then?

---
diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
index 1abe91ff6e4a..6d67e9a5af6b 100644
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -15,9 +15,11 @@ static inline void nohz_balance_enter_idle(int cpu) { }
 
 #ifdef CONFIG_NO_HZ_COMMON
 void calc_load_nohz_start(void);
+void calc_load_nohz_remote(struct rq *rq);
 void calc_load_nohz_stop(void);
 #else
 static inline void calc_load_nohz_start(void) { }
+static inline void calc_load_nohz_remote(struct rq *rq) { }
 static inline void calc_load_nohz_stop(void) { }
 #endif /* CONFIG_NO_HZ_COMMON */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb42b71faab9..d02d1b8f40af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3660,21 +3660,17 @@ static void sched_tick_remote(struct work_struct *work)
 	u64 delta;
 	int os;
 
-	/*
-	 * Handle the tick only if it appears the remote CPU is running in full
-	 * dynticks mode. The check is racy by nature, but missing a tick or
-	 * having one too much is no big deal because the scheduler tick updates
-	 * statistics and checks timeslices in a time-independent way, regardless
-	 * of when exactly it is running.
-	 */
-	if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
+	if (!tick_nohz_tick_stopped_cpu(cpu))
 		goto out_requeue;
 
 	rq_lock_irq(rq, &rf);
-	curr = rq->curr;
-	if (is_idle_task(curr) || cpu_is_offline(cpu))
+	/*
+	 * We must not call calc_load_nohz_remote() when not in NOHZ mode.
+	 */
+	if (cpu_is_offline(cpu) || !tick_nohz_tick_stopped(cpu))
 		goto out_unlock;
 
+	curr = rq->curr;
 	update_rq_clock(rq);
 	delta = rq_clock_task(rq) - curr->se.exec_start;
 
@@ -3685,10 +3681,11 @@ static void sched_tick_remote(struct work_struct *work)
 	WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
 	curr->sched_class->task_tick(rq, curr, 0);
 
+	calc_load_nohz_remote(rq);
 out_unlock:
 	rq_unlock_irq(rq, &rf);
-
 out_requeue:
+
 	/*
 	 * Run the remote tick once per second (1Hz). This arbitrary
 	 * frequency is large enough to avoid overload but short enough
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 28a516575c18..de22da666ac7 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -231,16 +231,11 @@ static inline int calc_load_read_idx(void)
 	return calc_load_idx & 1;
 }
 
-void calc_load_nohz_start(void)
+static void calc_load_nohz_fold(struct rq *rq)
 {
-	struct rq *this_rq = this_rq();
 	long delta;
 
-	/*
-	 * We're going into NO_HZ mode, if there's any pending delta, fold it
-	 * into the pending NO_HZ delta.
-	 */
-	delta = calc_load_fold_active(this_rq, 0);
+	delta = calc_load_fold_active(rq, 0);
 	if (delta) {
 		int idx = calc_load_write_idx();
 
@@ -248,6 +243,24 @@ void calc_load_nohz_start(void)
 	}
 }
 
+void calc_load_nohz_start(void)
+{
+	/*
+	 * We're going into NO_HZ mode, if there's any pending delta, fold it
+	 * into the pending NO_HZ delta.
+	 */
+	calc_load_nohz_fold(this_rq());
+}
+
+/*
+ * Keep track of the load for NOHZ_FULL, must be called between
+ * calc_load_nohz_{start,stop}().
+ */
+void calc_load_nohz_remote(struct rq *rq)
+{
+	calc_load_nohz_fold(rq);
+}
+
 void calc_load_nohz_stop(void)
 {
 	struct rq *this_rq = this_rq();
@@ -268,7 +281,7 @@ void calc_load_nohz_stop(void)
 		this_rq->calc_load_update += LOAD_FREQ;
 }
 
-static long calc_load_nohz_fold(void)
+static long calc_load_nohz_read(void)
 {
 	int idx = calc_load_read_idx();
 	long delta = 0;
@@ -323,7 +336,7 @@ static void calc_global_nohz(void)
 }
 #else /* !CONFIG_NO_HZ_COMMON */
 
-static inline long calc_load_nohz_fold(void) { return 0; }
+static inline long calc_load_nohz_read(void) { return 0; }
 static inline void calc_global_nohz(void) { }
 
 #endif /* CONFIG_NO_HZ_COMMON */
@@ -346,7 +359,7 @@ void calc_global_load(unsigned long ticks)
 	/*
 	 * Fold the 'old' NO_HZ-delta to include all NO_HZ CPUs.
 	 */
-	delta = calc_load_nohz_fold();
+	delta = calc_load_nohz_read();
 	if (delta)
 		atomic_long_add(delta, &calc_load_tasks);
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-10-30 13:31     ` Peter Zijlstra
@ 2019-11-01  5:11       ` Scott Wood
  2019-11-04 22:17         ` Thomas Gleixner
  2019-12-11 20:46       ` Scott Wood
  1 sibling, 1 reply; 15+ messages in thread
From: Scott Wood @ 2019-11-01  5:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, LKML

On Wed, 2019-10-30 at 14:31 +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2019 at 03:48:26AM -0500, Scott Wood wrote:
> > On Tue, 2019-10-29 at 11:05 +0100, Peter Zijlstra wrote:
> > > @@ -3686,6 +3688,7 @@ static void sched_tick_remote(struct work_struct
> > > *work)
> > >  	curr->sched_class->task_tick(rq, curr, 0);
> > >  
> > >  out_unlock:
> > > +	calc_load_nohz_remote(cpu);
> > >  	rq_unlock_irq(rq, &rf);
> > 
> > This gets skipped when the cpu is idle, so it still misses the update.
> 
> Oh argh! that's a bit radical of the remote tick. The normal tick runs
> just fine on idle CPUs, so lets mirror that.
> 
> How's this then?
> 
> ---
> diff --git a/include/linux/sched/nohz.h b/include/linux/sched/nohz.h
> index 1abe91ff6e4a..6d67e9a5af6b 100644
> --- a/include/linux/sched/nohz.h
> +++ b/include/linux/sched/nohz.h
> @@ -15,9 +15,11 @@ static inline void nohz_balance_enter_idle(int cpu) { }
>  
>  #ifdef CONFIG_NO_HZ_COMMON
>  void calc_load_nohz_start(void);
> +void calc_load_nohz_remote(struct rq *rq);
>  void calc_load_nohz_stop(void);
>  #else
>  static inline void calc_load_nohz_start(void) { }
> +static inline void calc_load_nohz_remote(struct rq *rq) { }
>  static inline void calc_load_nohz_stop(void) { }
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index eb42b71faab9..d02d1b8f40af 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3660,21 +3660,17 @@ static void sched_tick_remote(struct work_struct
> *work)
>  	u64 delta;
>  	int os;
>  
> -	/*
> -	 * Handle the tick only if it appears the remote CPU is running in
> full
> -	 * dynticks mode. The check is racy by nature, but missing a tick or
> -	 * having one too much is no big deal because the scheduler tick
> updates
> -	 * statistics and checks timeslices in a time-independent way,
> regardless
> -	 * of when exactly it is running.
> -	 */
> -	if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
> +	if (!tick_nohz_tick_stopped_cpu(cpu))
>  		goto out_requeue;
>  
>  	rq_lock_irq(rq, &rf);
> -	curr = rq->curr;
> -	if (is_idle_task(curr) || cpu_is_offline(cpu))
> +	/*
> +	 * We must not call calc_load_nohz_remote() when not in NOHZ mode.
> +	 */
> +	if (cpu_is_offline(cpu) || !tick_nohz_tick_stopped(cpu))
>  		goto out_unlock;

Needs to be tick_nohz_tick_stopped_cpu(cpu)

After fixing that, I get:

[    7.439068] WARNING: CPU: 20 PID: 7 at /home/root/linux/kernel/sched/core.c:3681 sched_tick_remote+0x132/0x150
[    7.439068] Modules linked in:
[    7.439068] CPU: 20 PID: 7 Comm: kworker/u209:0 Not tainted 5.4.0-rc5.std+ #15
[    7.439068] Hardware name: Intel Corporation S2600BT/S2600BT, BIOS SE5C620.86B.01.00.0763.022420181017 02/24/2018
[    7.439068] Workqueue: events_unbound sched_tick_remote
[    7.446308] pci_bus 0000:9f: resource 1 [mem 0xe6a00000-0xe6bfffff]
[    7.455068] RIP: 0010:sched_tick_remote+0x132/0x150
[    7.455068] Code: 00 e9 b2 fd fe ff 0f 0b e9 46 ff ff ff 83 f8 02 89 c2 74 d3 8d 4a ff 89 d0 f0 0f b1 0e 0f 94 c1 84 c9 0f 85 23 ff ff ff eb e3 <0f> 0b eb 9a 80 3d 9c d6 2c 01 00 0f 1f 00 0f 85 71 ff ff ff e8 05
[    7.455068] RSP: 0000:ffffc9000c683e58 EFLAGS: 00010002
[    7.455068] RAX: 00000000e7061da1 RBX: ffff8897e026e688 RCX: 0000000181f93295
[    7.455068] RDX: 00000000b2d05e00 RSI: ffff8897e0269e50 RDI: 0000000000000004
[    7.455068] RBP: ffff8881004c0000 R08: ffff8e8191a2b423 R09: 0000000000000000
[    7.455068] R10: 0000000000000010 R11: 0000000000000018 R12: ffff8897e0269240
[    7.455068] R13: ffff8897e0240000 R14: 0000000000000000 R15: ffff888107edc2e8
[    7.455068] FS:  0000000000000000(0000) GS:ffff8897e0700000(0000) knlGS:0000000000000000
[    7.455068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.455068] CR2: 0000000000000000 CR3: 000000303e60a001 CR4: 00000000007606e0
[    7.455068] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.459338] pci_bus 0000:9f: resource 2 [mem 0x3a0000000000-0x3a00001fffff 64bit pref]
[    7.465068] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.465068] PKRU: 55555554
[    7.465068] Call Trace:
[    7.465068]  process_one_work+0x165/0x3c0
[    7.465068]  worker_thread+0x46/0x3d0
[    7.465068]  kthread+0xf8/0x130
[    7.465068]  ? process_one_work+0x3c0/0x3c0
[    7.476788] pci_bus 0000:a0: resource 1 [mem 0xe6c00000-0xe6dfffff]
[    7.465068]  ? kthread_bind+0x10/0x10
[    7.465068]  ret_from_fork+0x35/0x40

-Scott


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-01  5:11       ` Scott Wood
@ 2019-11-04 22:17         ` Thomas Gleixner
  2019-11-04 23:43           ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2019-11-04 22:17 UTC (permalink / raw)
  To: Scott Wood; +Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML

On Fri, 1 Nov 2019, Scott Wood wrote:
> On Wed, 2019-10-30 at 14:31 +0100, Peter Zijlstra wrote:
> > Oh argh! that's a bit radical of the remote tick. The normal tick runs
> > just fine on idle CPUs, so lets mirror that.
> > 
> > How's this then?

....
 
> 
> Needs to be tick_nohz_tick_stopped_cpu(cpu)
> 
> After fixing that, I get:
> 
> [    7.439068] WARNING: CPU: 20 PID: 7 at /home/root/linux/kernel/sched/core.c:3681 sched_tick_remote+0x132/0x150

So I'm going to apply Scotts patch if nobody comes up with a better idea
until tomorrow.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-04 22:17         ` Thomas Gleixner
@ 2019-11-04 23:43           ` Thomas Gleixner
  2019-11-05  7:30             ` Scott Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2019-11-04 23:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML

On Mon, 4 Nov 2019, Thomas Gleixner wrote:
> On Fri, 1 Nov 2019, Scott Wood wrote:
> > On Wed, 2019-10-30 at 14:31 +0100, Peter Zijlstra wrote:
> > > Oh argh! that's a bit radical of the remote tick. The normal tick runs
> > > just fine on idle CPUs, so lets mirror that.
> > > 
> > > How's this then?
> 
> ....
>  
> > 
> > Needs to be tick_nohz_tick_stopped_cpu(cpu)
> > 
> > After fixing that, I get:
> > 
> > [    7.439068] WARNING: CPU: 20 PID: 7 at /home/root/linux/kernel/sched/core.c:3681 sched_tick_remote+0x132/0x150
> 
> So I'm going to apply Scotts patch if nobody comes up with a better idea
> until tomorrow.

As Peter pointed out to me privately we should rather go and analyze the
real thing instead of just applying duct tape.

/me drops the patch again.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-04 23:43           ` Thomas Gleixner
@ 2019-11-05  7:30             ` Scott Wood
  2019-11-05  9:53               ` Thomas Gleixner
  2019-11-05 12:43               ` Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Scott Wood @ 2019-11-05  7:30 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, 2019-11-05 at 00:43 +0100, Thomas Gleixner wrote:
> On Mon, 4 Nov 2019, Thomas Gleixner wrote:
> > On Fri, 1 Nov 2019, Scott Wood wrote:
> > > On Wed, 2019-10-30 at 14:31 +0100, Peter Zijlstra wrote:
> > > > Oh argh! that's a bit radical of the remote tick. The normal tick
> > > > runs
> > > > just fine on idle CPUs, so lets mirror that.
> > > > 
> > > > How's this then?
> > 
> > ....
> >  
> > > Needs to be tick_nohz_tick_stopped_cpu(cpu)
> > > 
> > > After fixing that, I get:
> > > 
> > > [    7.439068] WARNING: CPU: 20 PID: 7 at
> > > /home/root/linux/kernel/sched/core.c:3681
> > > sched_tick_remote+0x132/0x150
> > 
> > So I'm going to apply Scotts patch if nobody comes up with a better idea
> > until tomorrow.
> 
> As Peter pointed out to me privately we should rather go and analyze the
> real thing instead of just applying duct tape.
> 
> /me drops the patch again.

The warning is due to kernel/sched/idle.c not updating curr->se.exec_start.

While debugging I noticed an issue with a particular load pattern.  The CPU
goes non-nohz for a brief time at an interval very close to twice 
tick_period.  When the tick is started, the timer expiration is more than
tick_period in the past, so hrtimer_forward() tries to catch up by adding
2*tick_period to the expiration.  Then the tick is stopped before that new
expiration, and when the tick is woken up the expiry is again advanced by
2*tick_period with the timer never actually running.  sched_tick_remote()
does fire every second, but there are streaks of several seconds where it
keeps catching the CPU in a non-nohz state, so neither the normal nor remote
ticks are calling calc_load_nohz_remote().

Is there a reason to not just remove the hrtimer_forward() from
tick_nohz_restart(), letting the timer fire if it's in the past, which will
take care of doing hrtimer_forward()?

As for the warning in sched_tick_remote(), it seems like a test for time
since the last tick on this cpu (remote or otherwise) would be better than
relying on curr->se.exec_start, in order to detect things like this.

-Scott



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-05  7:30             ` Scott Wood
@ 2019-11-05  9:53               ` Thomas Gleixner
  2019-11-08  8:16                 ` Scott Wood
  2019-11-05 12:43               ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2019-11-05  9:53 UTC (permalink / raw)
  To: Scott Wood; +Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, 5 Nov 2019, Scott Wood wrote:
> On Tue, 2019-11-05 at 00:43 +0100, Thomas Gleixner wrote:
> > As Peter pointed out to me privately we should rather go and analyze the
> > real thing instead of just applying duct tape.
> > 
> > /me drops the patch again.
> 
> The warning is due to kernel/sched/idle.c not updating curr->se.exec_start.
> 
> While debugging I noticed an issue with a particular load pattern.  The CPU
> goes non-nohz for a brief time at an interval very close to twice 
> tick_period.  When the tick is started, the timer expiration is more than
> tick_period in the past, so hrtimer_forward() tries to catch up by adding
> 2*tick_period to the expiration.  Then the tick is stopped before that new
> expiration, and when the tick is woken up the expiry is again advanced by
> 2*tick_period with the timer never actually running.  sched_tick_remote()
> does fire every second, but there are streaks of several seconds where it
> keeps catching the CPU in a non-nohz state, so neither the normal nor remote
> ticks are calling calc_load_nohz_remote().
> 
> Is there a reason to not just remove the hrtimer_forward() from
> tick_nohz_restart(), letting the timer fire if it's in the past, which will
> take care of doing hrtimer_forward()?

Well, no. tick_nohz_restart() can be invoked in a situation where the timer
is armed for something in the far future (or completelt disabled) due to
previously entering an estimated long idle (or user space execution on
NOHZ_FULL) period.

That means if the timer is not canceled, realigned to the current tick and
forwarded to the next due tick, the tick will not fire on time causing
another sort of trouble.

> As for the warning in sched_tick_remote(), it seems like a test for time
> since the last tick on this cpu (remote or otherwise) would be better than
> relying on curr->se.exec_start, in order to detect things like this.

Care to give that a shot?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-05  7:30             ` Scott Wood
  2019-11-05  9:53               ` Thomas Gleixner
@ 2019-11-05 12:43               ` Peter Zijlstra
  2019-11-06  8:37                 ` Peter Zijlstra
  2019-11-08  8:13                 ` Scott Wood
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2019-11-05 12:43 UTC (permalink / raw)
  To: Scott Wood; +Cc: Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, Nov 05, 2019 at 01:30:58AM -0600, Scott Wood wrote:
> The warning is due to kernel/sched/idle.c not updating curr->se.exec_start.

Ah, indeed so.

> While debugging I noticed an issue with a particular load pattern.  The CPU
> goes non-nohz for a brief time at an interval very close to twice 
> tick_period.  When the tick is started, the timer expiration is more than
> tick_period in the past, so hrtimer_forward() tries to catch up by adding
> 2*tick_period to the expiration.  Then the tick is stopped before that new
> expiration, and when the tick is woken up the expiry is again advanced by
> 2*tick_period with the timer never actually running.  sched_tick_remote()
> does fire every second, but there are streaks of several seconds where it
> keeps catching the CPU in a non-nohz state, so neither the normal nor remote
> ticks are calling calc_load_nohz_remote().
> 
> Is there a reason to not just remove the hrtimer_forward() from
> tick_nohz_restart(), letting the timer fire if it's in the past, which will
> take care of doing hrtimer_forward()?

I'll have to look into that. I always get confused by all that nohz code
:/

> As for the warning in sched_tick_remote(), it seems like a test for time
> since the last tick on this cpu (remote or otherwise) would be better than
> relying on curr->se.exec_start, in order to detect things like this.

I don't think we have a timestamp that is shared between the remote and
local tick. Also, there is a reason this warning uses the task time
accounting, there used to be (as in, I can't find it in a hurry) code
that could not deal with >u32 (~4s) clock updates.

The below should have idle keep the timestamp up-to-date. Keeping
accurate idle->se.sum_exec_runtime doesn't seem too interesting, the
idle code already keeps track of total idle times.

---

--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -381,6 +381,7 @@ static void put_prev_task_idle(struct rq
 
 static void set_next_task_idle(struct rq *rq, struct task_struct *next)
 {
+	curr->se.exec_start = rq_clock_task(rq);
 	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
 }
@@ -417,6 +418,7 @@ dequeue_task_idle(struct rq *rq, struct
  */
 static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
 {
+	curr->se.exec_start = rq_clock_task(rq);
 }
 
 static void switched_to_idle(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-05 12:43               ` Peter Zijlstra
@ 2019-11-06  8:37                 ` Peter Zijlstra
  2019-11-08  8:13                 ` Scott Wood
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2019-11-06  8:37 UTC (permalink / raw)
  To: Scott Wood; +Cc: Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, Nov 05, 2019 at 01:43:51PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2019 at 01:30:58AM -0600, Scott Wood wrote:
> > The warning is due to kernel/sched/idle.c not updating curr->se.exec_start.
> 
> Ah, indeed so.
> 
> > While debugging I noticed an issue with a particular load pattern.  The CPU
> > goes non-nohz for a brief time at an interval very close to twice 
> > tick_period.  When the tick is started, the timer expiration is more than
> > tick_period in the past, so hrtimer_forward() tries to catch up by adding
> > 2*tick_period to the expiration.  Then the tick is stopped before that new
> > expiration, and when the tick is woken up the expiry is again advanced by
> > 2*tick_period with the timer never actually running.  sched_tick_remote()
> > does fire every second, but there are streaks of several seconds where it
> > keeps catching the CPU in a non-nohz state, so neither the normal nor remote
> > ticks are calling calc_load_nohz_remote().
> > 
> > Is there a reason to not just remove the hrtimer_forward() from
> > tick_nohz_restart(), letting the timer fire if it's in the past, which will
> > take care of doing hrtimer_forward()?
> 
> I'll have to look into that. I always get confused by all that nohz code
> :/
> 
> > As for the warning in sched_tick_remote(), it seems like a test for time
> > since the last tick on this cpu (remote or otherwise) would be better than
> > relying on curr->se.exec_start, in order to detect things like this.
> 
> I don't think we have a timestamp that is shared between the remote and
> local tick. Also, there is a reason this warning uses the task time
> accounting, there used to be (as in, I can't find it in a hurry) code
> that could not deal with >u32 (~4s) clock updates.
> 
> The below should have idle keep the timestamp up-to-date. Keeping
> accurate idle->se.sum_exec_runtime doesn't seem too interesting, the
> idle code already keeps track of total idle times.
> 
> ---

The obvious alternative is something like:

	if (rq->curr != rq->idle) {
		s64 delta = rq_clock_task(rq) - curr->se.exec_start;
		WARN_ON_ONCE(delta > 3ULL * NSEC_PER_SEC);
	}

Which would avoid polluting the idle path with that extra assignment.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-05 12:43               ` Peter Zijlstra
  2019-11-06  8:37                 ` Peter Zijlstra
@ 2019-11-08  8:13                 ` Scott Wood
  2019-12-11 20:37                   ` Scott Wood
  1 sibling, 1 reply; 15+ messages in thread
From: Scott Wood @ 2019-11-08  8:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, 2019-11-05 at 13:43 +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2019 at 01:30:58AM -0600, Scott Wood wrote:
> > As for the warning in sched_tick_remote(), it seems like a test for time
> > since the last tick on this cpu (remote or otherwise) would be better
> > than
> > relying on curr->se.exec_start, in order to detect things like this.
> 
> I don't think we have a timestamp that is shared between the remote and
> local tick. 

Why wouldn't rq_clock_task() work on the local tick?  It's what
->task_tick() itself uses.

> Also, there is a reason this warning uses the task time
> accounting, there used to be (as in, I can't find it in a hurry) code
> that could not deal with >u32 (~4s) clock updates.

Detecting a 3 second interval between ticks for a given cpu should assert in
a superset of the situations the current check asserts in -- it just avoids
the false negative of exec_runtime getting updated by something other than
the tick.

-Scott



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-05  9:53               ` Thomas Gleixner
@ 2019-11-08  8:16                 ` Scott Wood
  0 siblings, 0 replies; 15+ messages in thread
From: Scott Wood @ 2019-11-08  8:16 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, Frederic Weisbecker, Ingo Molnar, LKML

On Tue, 2019-11-05 at 10:53 +0100, Thomas Gleixner wrote:
> On Tue, 5 Nov 2019, Scott Wood wrote:
> > On Tue, 2019-11-05 at 00:43 +0100, Thomas Gleixner wrote:
> > > As Peter pointed out to me privately we should rather go and analyze
> > > the
> > > real thing instead of just applying duct tape.
> > > 
> > > /me drops the patch again.
> > 
> > The warning is due to kernel/sched/idle.c not updating curr-
> > >se.exec_start.
> > 
> > While debugging I noticed an issue with a particular load pattern.  The
> > CPU
> > goes non-nohz for a brief time at an interval very close to twice 
> > tick_period.  When the tick is started, the timer expiration is more
> > than
> > tick_period in the past, so hrtimer_forward() tries to catch up by
> > adding
> > 2*tick_period to the expiration.  Then the tick is stopped before that
> > new
> > expiration, and when the tick is woken up the expiry is again advanced
> > by
> > 2*tick_period with the timer never actually
> > running.  sched_tick_remote()
> > does fire every second, but there are streaks of several seconds where
> > it
> > keeps catching the CPU in a non-nohz state, so neither the normal nor
> > remote
> > ticks are calling calc_load_nohz_remote().
> > 
> > Is there a reason to not just remove the hrtimer_forward() from
> > tick_nohz_restart(), letting the timer fire if it's in the past, which
> > will
> > take care of doing hrtimer_forward()?
> 
> Well, no. tick_nohz_restart() can be invoked in a situation where the
> timer
> is armed for something in the far future (or completelt disabled) due to
> previously entering an estimated long idle (or user space execution on
> NOHZ_FULL) period.
> 
> That means if the timer is not canceled, realigned to the current tick and
> forwarded to the next due tick, the tick will not fire on time causing
> another sort of trouble.

That might be true of the expiry on entering tick_nohz_restart(), but it
shouldn't be true of ts->last_tick which the expiry is set to before calling
hrtimer_forward() -- and if it were, hrtimer_forward() is a no-op when the
expiry is in the future.

BTW, the name "last_tick" seems misleading as it's actually the saved
expiry, not the last time the tick ran.

-Scott



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-11-08  8:13                 ` Scott Wood
@ 2019-12-11 20:37                   ` Scott Wood
  0 siblings, 0 replies; 15+ messages in thread
From: Scott Wood @ 2019-12-11 20:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, Frederic Weisbecker, Ingo Molnar, LKML

On Fri, 2019-11-08 at 02:13 -0600, Scott Wood wrote:
> On Tue, 2019-11-05 at 13:43 +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2019 at 01:30:58AM -0600, Scott Wood wrote:
> > > As for the warning in sched_tick_remote(), it seems like a test for
> > > time
> > > since the last tick on this cpu (remote or otherwise) would be better
> > > than
> > > relying on curr->se.exec_start, in order to detect things like this.
> > 
> > I don't think we have a timestamp that is shared between the remote and
> > local tick. 
> 
> Why wouldn't rq_clock_task() work on the local tick?  It's what
> ->task_tick() itself uses.
> 
> > Also, there is a reason this warning uses the task time
> > accounting, there used to be (as in, I can't find it in a hurry) code
> > that could not deal with >u32 (~4s) clock updates.
> 
> Detecting a 3 second interval between ticks for a given cpu should
> assert in a superset of the situations the current check asserts in --
> it just avoids the false negative of exec_runtime getting updated by
> something other than the tick.

The main difficulty with such a check is that when we're not on a full
nohz cpu, there's no remote tick, and so we can legitimately go more than
3 seconds between ticks when idle.

-Scott


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] timers/nohz: Update nohz load even if tick already stopped
  2019-10-30 13:31     ` Peter Zijlstra
  2019-11-01  5:11       ` Scott Wood
@ 2019-12-11 20:46       ` Scott Wood
  1 sibling, 0 replies; 15+ messages in thread
From: Scott Wood @ 2019-12-11 20:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Frederic Weisbecker, Thomas Gleixner, Ingo Molnar, LKML

On Wed, 2019-10-30 at 14:31 +0100, Peter Zijlstra wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index eb42b71faab9..d02d1b8f40af 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3660,21 +3660,17 @@ static void sched_tick_remote(struct work_struct
> *work)
>  	u64 delta;
>  	int os;
>  
> -	/*
> -	 * Handle the tick only if it appears the remote CPU is running in
> full
> -	 * dynticks mode. The check is racy by nature, but missing a tick or
> -	 * having one too much is no big deal because the scheduler tick
> updates
> -	 * statistics and checks timeslices in a time-independent way,
> regardless
> -	 * of when exactly it is running.
> -	 */
> -	if (idle_cpu(cpu) || !tick_nohz_tick_stopped_cpu(cpu))
> +	if (!tick_nohz_tick_stopped_cpu(cpu))
>  		goto out_requeue;
>  
>  	rq_lock_irq(rq, &rf);
> -	curr = rq->curr;
> -	if (is_idle_task(curr) || cpu_is_offline(cpu))
> +	/*
> +	 * We must not call calc_load_nohz_remote() when not in NOHZ mode.
> +	 */
> +	if (cpu_is_offline(cpu) || !tick_nohz_tick_stopped(cpu))
>  		goto out_unlock;

Is it really a problem if calc_load_nohz_remote() gets called in
non-NOHZ?  It won't race due to rq lock -- and we're already mixing
remote and non-remote updates because the normal tick timer can still be
run while "stopped".

-Scott


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-12-11 20:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 15:07 [PATCH] timers/nohz: Update nohz load even if tick already stopped Frederic Weisbecker
2019-10-29 10:05 ` Peter Zijlstra
2019-10-30  8:48   ` Scott Wood
2019-10-30 13:31     ` Peter Zijlstra
2019-11-01  5:11       ` Scott Wood
2019-11-04 22:17         ` Thomas Gleixner
2019-11-04 23:43           ` Thomas Gleixner
2019-11-05  7:30             ` Scott Wood
2019-11-05  9:53               ` Thomas Gleixner
2019-11-08  8:16                 ` Scott Wood
2019-11-05 12:43               ` Peter Zijlstra
2019-11-06  8:37                 ` Peter Zijlstra
2019-11-08  8:13                 ` Scott Wood
2019-12-11 20:37                   ` Scott Wood
2019-12-11 20:46       ` Scott Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).