All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] clocksource: Warn if too many missing ticks are detected
@ 2018-09-19  1:19 Waiman Long
  2018-09-19  7:53 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Waiman Long @ 2018-09-19  1:19 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner
  Cc: linux-kernel, Stephen Boyd, Peter Zijlstra, Waiman Long

The clocksource watchdog, when running, is scheduled on all the CPUs in
the system sequentially on a round-robin fashion with a period of 0.5s.
A bug in the 4.18 kernel is causing missing ticks when nohz_full
is specified. Under some circumstances, this causes the watchdog to
incorrectly state that the TSC is unstable because of counter overflow
in the hpet watchdog clock source after a few minutes delay.

That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
idle: Avoid retaining the tick when it has been stopped"). To make it
easier to catch this kind of bug in the future, a check is added to see
if there is too much delay in the invocation of the watchdog callback
and print a warning once if it happens.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/time/clocksource.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 0e6e97a..64b05ab 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -213,6 +213,18 @@ static void clocksource_watchdog(struct timer_list *unused)
 	if (!watchdog_running)
 		goto out;
 
+	/*
+	 * When the timer tick is incorrectly stopped on a CPU with
+	 * pending events, for example, it is possible that the
+	 * clocksource watchdog will stop running for a sufficiently
+	 * long enough time to cause overflow in the delta
+	 * computation leading to incorrect report of unstable clock
+	 * source. So print a warning if there is unusually large
+	 * delay (> 0.5s) in the invocation of the watchdog. That
+	 * can indicate a hidden bug in the timer tick code.
+	 */
+	WARN_ON_ONCE(jiffies - watchdog_timer.expires > WATCHDOG_INTERVAL);
+
 	reset_pending = atomic_read(&watchdog_reset_pending);
 
 	list_for_each_entry(cs, &watchdog_list, wd_list) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] clocksource: Warn if too many missing ticks are detected
  2018-09-19  1:19 [PATCH v2] clocksource: Warn if too many missing ticks are detected Waiman Long
@ 2018-09-19  7:53 ` Thomas Gleixner
  2018-09-19  9:17   ` Peter Zijlstra
  2018-09-24 15:35   ` Waiman Long
  0 siblings, 2 replies; 5+ messages in thread
From: Thomas Gleixner @ 2018-09-19  7:53 UTC (permalink / raw)
  To: Waiman Long; +Cc: John Stultz, linux-kernel, Stephen Boyd, Peter Zijlstra

On Tue, 18 Sep 2018, Waiman Long wrote:

> The clocksource watchdog, when running, is scheduled on all the CPUs in
> the system sequentially on a round-robin fashion with a period of 0.5s.
> A bug in the 4.18 kernel is causing missing ticks when nohz_full
> is specified. Under some circumstances, this causes the watchdog to
> incorrectly state that the TSC is unstable because of counter overflow
> in the hpet watchdog clock source after a few minutes delay.
> 
> That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
> idle: Avoid retaining the tick when it has been stopped"). To make it
> easier to catch this kind of bug in the future, a check is added to see
> if there is too much delay in the invocation of the watchdog callback
> and print a warning once if it happens.

Second thoughts on this. Putting the check into the clocksource watchdog is
the wrong place as it's just checking at a place where the symptom
shows. What about putting it right to the source, i.e. in the timer wheel
as it does not depend on the clocksource watchdog being active. The
clocksource watchdog triggering is just one of the symptoms, but in general
timers being massively late is not a good thing.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] clocksource: Warn if too many missing ticks are detected
  2018-09-19  7:53 ` Thomas Gleixner
@ 2018-09-19  9:17   ` Peter Zijlstra
  2018-09-19 16:32     ` Waiman Long
  2018-09-24 15:35   ` Waiman Long
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2018-09-19  9:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Waiman Long, John Stultz, linux-kernel, Stephen Boyd

On Wed, Sep 19, 2018 at 09:53:47AM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Waiman Long wrote:
> 
> > The clocksource watchdog, when running, is scheduled on all the CPUs in
> > the system sequentially on a round-robin fashion with a period of 0.5s.
> > A bug in the 4.18 kernel is causing missing ticks when nohz_full
> > is specified. Under some circumstances, this causes the watchdog to
> > incorrectly state that the TSC is unstable because of counter overflow
> > in the hpet watchdog clock source after a few minutes delay.
> > 
> > That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
> > idle: Avoid retaining the tick when it has been stopped"). To make it
> > easier to catch this kind of bug in the future, a check is added to see
> > if there is too much delay in the invocation of the watchdog callback
> > and print a warning once if it happens.
> 
> Second thoughts on this. Putting the check into the clocksource watchdog is
> the wrong place as it's just checking at a place where the symptom
> shows. What about putting it right to the source, i.e. in the timer wheel
> as it does not depend on the clocksource watchdog being active. The
> clocksource watchdog triggering is just one of the symptoms, but in general
> timers being massively late is not a good thing.

Just make sure to think of the virt case; virt can cause all kind of
'fun' lateness.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] clocksource: Warn if too many missing ticks are detected
  2018-09-19  9:17   ` Peter Zijlstra
@ 2018-09-19 16:32     ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2018-09-19 16:32 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner; +Cc: John Stultz, linux-kernel, Stephen Boyd

On 09/19/2018 05:17 AM, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 09:53:47AM +0200, Thomas Gleixner wrote:
>> On Tue, 18 Sep 2018, Waiman Long wrote:
>>
>>> The clocksource watchdog, when running, is scheduled on all the CPUs in
>>> the system sequentially on a round-robin fashion with a period of 0.5s.
>>> A bug in the 4.18 kernel is causing missing ticks when nohz_full
>>> is specified. Under some circumstances, this causes the watchdog to
>>> incorrectly state that the TSC is unstable because of counter overflow
>>> in the hpet watchdog clock source after a few minutes delay.
>>>
>>> That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
>>> idle: Avoid retaining the tick when it has been stopped"). To make it
>>> easier to catch this kind of bug in the future, a check is added to see
>>> if there is too much delay in the invocation of the watchdog callback
>>> and print a warning once if it happens.
>> Second thoughts on this. Putting the check into the clocksource watchdog is
>> the wrong place as it's just checking at a place where the symptom
>> shows. What about putting it right to the source, i.e. in the timer wheel
>> as it does not depend on the clocksource watchdog being active. The
>> clocksource watchdog triggering is just one of the symptoms, but in general
>> timers being massively late is not a good thing.
> Just make sure to think of the virt case; virt can cause all kind of
> 'fun' lateness.

So the question now is how much "lateness" is abnormal? My current patch
just use a 1/2 s threshold. Is that too small or too big?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] clocksource: Warn if too many missing ticks are detected
  2018-09-19  7:53 ` Thomas Gleixner
  2018-09-19  9:17   ` Peter Zijlstra
@ 2018-09-24 15:35   ` Waiman Long
  1 sibling, 0 replies; 5+ messages in thread
From: Waiman Long @ 2018-09-24 15:35 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: John Stultz, linux-kernel, Stephen Boyd, Peter Zijlstra

On 09/19/2018 03:53 AM, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Waiman Long wrote:
>
>> The clocksource watchdog, when running, is scheduled on all the CPUs in
>> the system sequentially on a round-robin fashion with a period of 0.5s.
>> A bug in the 4.18 kernel is causing missing ticks when nohz_full
>> is specified. Under some circumstances, this causes the watchdog to
>> incorrectly state that the TSC is unstable because of counter overflow
>> in the hpet watchdog clock source after a few minutes delay.
>>
>> That particular bug is fixed by the 4.19 commit 7059b36636beab ("sched:
>> idle: Avoid retaining the tick when it has been stopped"). To make it
>> easier to catch this kind of bug in the future, a check is added to see
>> if there is too much delay in the invocation of the watchdog callback
>> and print a warning once if it happens.
> Second thoughts on this. Putting the check into the clocksource watchdog is
> the wrong place as it's just checking at a place where the symptom
> shows. What about putting it right to the source, i.e. in the timer wheel
> as it does not depend on the clocksource watchdog being active. The
> clocksource watchdog triggering is just one of the symptoms, but in general
> timers being massively late is not a good thing.
>
> Thanks,
>
> 	tglx
>
It looks like there are more complications to deal with in order to do
the proper check. I will leave the check in the watchdog for the time
being and see how thing works out.

-Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-09-24 15:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19  1:19 [PATCH v2] clocksource: Warn if too many missing ticks are detected Waiman Long
2018-09-19  7:53 ` Thomas Gleixner
2018-09-19  9:17   ` Peter Zijlstra
2018-09-19 16:32     ` Waiman Long
2018-09-24 15:35   ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.