linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tick: Detect and fix jiffies update stall
@ 2022-02-02  0:01 Frederic Weisbecker
  2022-02-02  1:49 ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Frederic Weisbecker @ 2022-02-02  0:01 UTC (permalink / raw)
  To: Paul E . McKenney; +Cc: LKML, Frederic Weisbecker, Thomas Gleixner

On some rare cases, the timekeeper CPU may be delaying its jiffies
update duty for a while. Known causes include:

* The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
  or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
  updates while waiting for the target CPU to complete its
  stop_machine() callback.

* The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
  on the host.

Detect and fix these situations with emergency timekeeping catchups.

Original-patch-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c | 17 +++++++++++++++++
 kernel/time/tick-sched.h |  4 ++++
 2 files changed, 21 insertions(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 17a283ce2b20..c89f50a7e690 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -169,6 +169,8 @@ static ktime_t tick_init_jiffy_update(void)
 	return period;
 }
 
+#define MAX_STALLED_JIFFIES 5
+
 static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
 {
 	int cpu = smp_processor_id();
@@ -196,6 +198,21 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
 	if (tick_do_timer_cpu == cpu)
 		tick_do_update_jiffies64(now);
 
+	/*
+	 * If jiffies update stalled for too long (timekeeper in stop_machine()
+	 * or VMEXIT'ed for several msecs), force an update.
+	 */
+	if (ts->last_tick_jiffies != jiffies) {
+		ts->stalled_jiffies = 0;
+		ts->last_tick_jiffies = READ_ONCE(jiffies);
+	} else {
+		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
+			tick_do_update_jiffies64(now);
+			ts->stalled_jiffies = 0;
+			ts->last_tick_jiffies = READ_ONCE(jiffies);
+		}
+	}
+
 	if (ts->inidle)
 		ts->got_idle_tick = 1;
 }
diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
index d952ae393423..504649513399 100644
--- a/kernel/time/tick-sched.h
+++ b/kernel/time/tick-sched.h
@@ -49,6 +49,8 @@ enum tick_nohz_mode {
  * @timer_expires_base:	Base time clock monotonic for @timer_expires
  * @next_timer:		Expiry time of next expiring timer for debugging purpose only
  * @tick_dep_mask:	Tick dependency mask - is set, if someone needs the tick
+ * @last_tick_jiffies:	Value of jiffies seen on last tick
+ * @stalled_jiffies:	Number of stalled jiffies detected across ticks
  */
 struct tick_sched {
 	struct hrtimer			sched_timer;
@@ -77,6 +79,8 @@ struct tick_sched {
 	u64				next_timer;
 	ktime_t				idle_expires;
 	atomic_t			tick_dep_mask;
+	unsigned long			last_tick_jiffies;
+	unsigned int			stalled_jiffies;
 };
 
 extern struct tick_sched *tick_get_tick_sched(int cpu);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] tick: Detect and fix jiffies update stall
  2022-02-02  0:01 [PATCH] tick: Detect and fix jiffies update stall Frederic Weisbecker
@ 2022-02-02  1:49 ` Paul E. McKenney
  2022-02-02  2:19   ` Frederic Weisbecker
  0 siblings, 1 reply; 4+ messages in thread
From: Paul E. McKenney @ 2022-02-02  1:49 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: LKML, Thomas Gleixner

On Wed, Feb 02, 2022 at 01:01:07AM +0100, Frederic Weisbecker wrote:
> On some rare cases, the timekeeper CPU may be delaying its jiffies
> update duty for a while. Known causes include:
> 
> * The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
>   or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
>   updates while waiting for the target CPU to complete its
>   stop_machine() callback.
> 
> * The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
>   on the host.
> 
> Detect and fix these situations with emergency timekeeping catchups.
> 
> Original-patch-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>

Nice, thank you!

So I should revert your earlier patch, apply this one, and then test
the result?

							Thanx, Paul

> ---
>  kernel/time/tick-sched.c | 17 +++++++++++++++++
>  kernel/time/tick-sched.h |  4 ++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 17a283ce2b20..c89f50a7e690 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -169,6 +169,8 @@ static ktime_t tick_init_jiffy_update(void)
>  	return period;
>  }
>  
> +#define MAX_STALLED_JIFFIES 5
> +
>  static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
>  {
>  	int cpu = smp_processor_id();
> @@ -196,6 +198,21 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
>  	if (tick_do_timer_cpu == cpu)
>  		tick_do_update_jiffies64(now);
>  
> +	/*
> +	 * If jiffies update stalled for too long (timekeeper in stop_machine()
> +	 * or VMEXIT'ed for several msecs), force an update.
> +	 */
> +	if (ts->last_tick_jiffies != jiffies) {
> +		ts->stalled_jiffies = 0;
> +		ts->last_tick_jiffies = READ_ONCE(jiffies);
> +	} else {
> +		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
> +			tick_do_update_jiffies64(now);
> +			ts->stalled_jiffies = 0;
> +			ts->last_tick_jiffies = READ_ONCE(jiffies);
> +		}
> +	}
> +
>  	if (ts->inidle)
>  		ts->got_idle_tick = 1;
>  }
> diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h
> index d952ae393423..504649513399 100644
> --- a/kernel/time/tick-sched.h
> +++ b/kernel/time/tick-sched.h
> @@ -49,6 +49,8 @@ enum tick_nohz_mode {
>   * @timer_expires_base:	Base time clock monotonic for @timer_expires
>   * @next_timer:		Expiry time of next expiring timer for debugging purpose only
>   * @tick_dep_mask:	Tick dependency mask - is set, if someone needs the tick
> + * @last_tick_jiffies:	Value of jiffies seen on last tick
> + * @stalled_jiffies:	Number of stalled jiffies detected across ticks
>   */
>  struct tick_sched {
>  	struct hrtimer			sched_timer;
> @@ -77,6 +79,8 @@ struct tick_sched {
>  	u64				next_timer;
>  	ktime_t				idle_expires;
>  	atomic_t			tick_dep_mask;
> +	unsigned long			last_tick_jiffies;
> +	unsigned int			stalled_jiffies;
>  };
>  
>  extern struct tick_sched *tick_get_tick_sched(int cpu);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tick: Detect and fix jiffies update stall
  2022-02-02  1:49 ` Paul E. McKenney
@ 2022-02-02  2:19   ` Frederic Weisbecker
  2022-02-02 17:24     ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Frederic Weisbecker @ 2022-02-02  2:19 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: LKML, Thomas Gleixner

On Tue, Feb 01, 2022 at 05:49:34PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 02, 2022 at 01:01:07AM +0100, Frederic Weisbecker wrote:
> > On some rare cases, the timekeeper CPU may be delaying its jiffies
> > update duty for a while. Known causes include:
> > 
> > * The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
> >   or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
> >   updates while waiting for the target CPU to complete its
> >   stop_machine() callback.
> > 
> > * The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
> >   on the host.
> > 
> > Detect and fix these situations with emergency timekeeping catchups.
> > 
> > Original-patch-by: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> 
> Nice, thank you!
> 
> So I should revert your earlier patch, apply this one, and then test
> the result?

No need to revert the nohz_full fix, this new one deals with non-dynticks
issues. This way we cover every timekeeper stall situations:

_ dynticks-idle is handled on IRQ entry

_ full dynticks is handled on IRQ entry in case of CPU 0 (traditional nohz_full
  timekeeper) timekeeping stall. Let's hope we won't need to handle syscalls and
  faults as well but we'll see...
  
_ periodic ticks are now handled on the tick.

So you just need to apply this patch on your dev branch for testing.

Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tick: Detect and fix jiffies update stall
  2022-02-02  2:19   ` Frederic Weisbecker
@ 2022-02-02 17:24     ` Paul E. McKenney
  0 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2022-02-02 17:24 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: LKML, Thomas Gleixner

On Wed, Feb 02, 2022 at 03:19:51AM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 01, 2022 at 05:49:34PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 02, 2022 at 01:01:07AM +0100, Frederic Weisbecker wrote:
> > > On some rare cases, the timekeeper CPU may be delaying its jiffies
> > > update duty for a while. Known causes include:
> > > 
> > > * The timekeeper is waiting on stop_machine in a MULTI_STOP_DISABLE_IRQ
> > >   or MULTI_STOP_RUN state. Disabled interrupts prevent from timekeeping
> > >   updates while waiting for the target CPU to complete its
> > >   stop_machine() callback.
> > > 
> > > * The timekeeper vcpu has VMEXIT'ed for a long while due to some overload
> > >   on the host.
> > > 
> > > Detect and fix these situations with emergency timekeeping catchups.
> > > 
> > > Original-patch-by: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > 
> > Nice, thank you!
> > 
> > So I should revert your earlier patch, apply this one, and then test
> > the result?
> 
> No need to revert the nohz_full fix, this new one deals with non-dynticks
> issues. This way we cover every timekeeper stall situations:
> 
> _ dynticks-idle is handled on IRQ entry
> 
> _ full dynticks is handled on IRQ entry in case of CPU 0 (traditional nohz_full
>   timekeeper) timekeeping stall. Let's hope we won't need to handle syscalls and
>   faults as well but we'll see...
>   
> _ periodic ticks are now handled on the tick.
> 
> So you just need to apply this patch on your dev branch for testing.

I have pulled it in, thank you!  I will beat on it.

I am guessing that this goes up some other path to mainline, so I have
marked it "EXP".

							Thanx, Paul

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-02-02 17:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-02  0:01 [PATCH] tick: Detect and fix jiffies update stall Frederic Weisbecker
2022-02-02  1:49 ` Paul E. McKenney
2022-02-02  2:19   ` Frederic Weisbecker
2022-02-02 17:24     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).