linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
@ 2021-08-11  9:55 brookxu
  2021-08-11 12:44 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: brookxu @ 2021-08-11  9:55 UTC (permalink / raw)
  To: john.stultz, tglx, sboyd; +Cc: linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

After patch 1f45f1f3 (clocksource: Make clocksource validation work
for all clocksources), md_nsec may be 0 in some scenarios, such as
the watchdog is delayed for a long time or the watchdog has a
time-warp.

We found a problem when testing nvme disks with fio, when multiple
queue interrupts of a disk were mapped to a single CPU. IO interrupt
processing will cause the watchdog to be delayed for a long time
(155 seconds), the system reports TSC unstable and switches the clock
to hpet. It seems that this scenario cannot be handled by optimizing
softirq. Therefore, when md_nsec returns 0, the machine or watchdog
should be in unstable state,the verification result not unreliable.
Is it possible for us to skip the current check at this time?
1. If the watchdog is delayed because the system is busy, and the
   clocksource is switched to hpet due to a wrong judgment, the
   performance degradation may directly cause the machine to be
   unavailable and cause more problems.
2. If watchdog has time-warp, we should not rely on hpet to directly
   mark TSC as unstable.

Later we register watchdog to other CPU, if other CPU is not busy, we
can also check the stability of TSC.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 kernel/time/clocksource.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b89c76e1c02c..9b9014d67f1d 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -399,6 +399,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 		cs->cs_last = csnow;
 		cs->wd_last = wdnow;
 
+		if (!wd_nsec) {
+			pr_warn("timekeeping watchdog on CPU%d seems hung up or unstable:");
+			pr_warn("'%s' wd_now: %llx wd_last: %llx mask: %llx\n",
+				watchdog->name, wdnow, wdlast, watchdog->mask);
+			continue;
+		}
+
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-11  9:55 [RFC PATCH] clocksource: skip check while watchdog hung up or unstable brookxu
@ 2021-08-11 12:44 ` Thomas Gleixner
  2021-08-11 13:18   ` brookxu
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2021-08-11 12:44 UTC (permalink / raw)
  To: brookxu, john.stultz, sboyd; +Cc: linux-kernel

On Wed, Aug 11 2021 at 17:55, brookxu wrote:
> From: Chunguang Xu <brookxu@tencent.com>
>
> After patch 1f45f1f3 (clocksource: Make clocksource validation work
> for all clocksources), md_nsec may be 0 in some scenarios, such as
> the watchdog is delayed for a long time or the watchdog has a
> time-warp.

Maybe 0? There is exactly one single possibility for it to be zero:

  cs->wd_last == wdnow, i.e. delta = 0 -> wd_nsec = 0

So how does that condition solve any long delay or wrap around of the
watchdog? It's more than unlikely to hit exactly this case where the
readout is identical to the previous readout unless the watchdog stopped
counting.

> We found a problem when testing nvme disks with fio, when multiple
> queue interrupts of a disk were mapped to a single CPU. IO interrupt
> processing will cause the watchdog to be delayed for a long time
> (155 seconds), the system reports TSC unstable and switches the clock

If you hold off the softirq from running for 155 seconds then the TSC
watchdog is the least of your problems.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-11 12:44 ` Thomas Gleixner
@ 2021-08-11 13:18   ` brookxu
  2021-08-11 14:01     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: brookxu @ 2021-08-11 13:18 UTC (permalink / raw)
  To: Thomas Gleixner, john.stultz, sboyd; +Cc: linux-kernel


Thanks for your time.

Thomas Gleixner wrote on 2021/8/11 8:44 下午:
> On Wed, Aug 11 2021 at 17:55, brookxu wrote:
>> From: Chunguang Xu <brookxu@tencent.com>
>>
>> After patch 1f45f1f3 (clocksource: Make clocksource validation work
>> for all clocksources), md_nsec may be 0 in some scenarios, such as
>> the watchdog is delayed for a long time or the watchdog has a
>> time-warp.
> 
> Maybe 0? There is exactly one single possibility for it to be zero:
> 
>   cs->wd_last == wdnow, i.e. delta = 0 -> wd_nsec = 0
> 
> So how does that condition solve any long delay or wrap around of the
> watchdog? It's more than unlikely to hit exactly this case where the
> readout is identical to the previous readout unless the watchdog stopped
> counting.

Maybe I missed something. Like this example, when watchdog run ,hpet have
wrap around:

'hpet' wd_now: d76e5a69 wd_last: f929eb3c mask: ffffffff

We can calculate the number of elapsed cycles:
cycles = wd_now - wd_last = 0xde446f2d

clocksource_delta() uses the MSB to determine an invalid inteval and returns
0, but for 0xde446f2d, this judgment should be wrong.


>> We found a problem when testing nvme disks with fio, when multiple
>> queue interrupts of a disk were mapped to a single CPU. IO interrupt
>> processing will cause the watchdog to be delayed for a long time
>> (155 seconds), the system reports TSC unstable and switches the clock
> 
> If you hold off the softirq from running for 155 seconds then the TSC
> watchdog is the least of your problems.

To be precise, we are processing interrupts in handle_edge_irq() for a long
time. Since the interrupts of multiple hardware queues are mapped to a single
CPU, multiple cores are continuously issuing IO, and then a single core is
processing IO. Perhaps the test case can be optimized, but shouldn't this lead
to switching clocks in principle?

> Thanks,
> 
>         tglx
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-11 13:18   ` brookxu
@ 2021-08-11 14:01     ` Thomas Gleixner
  2021-08-11 15:26       ` brookxu
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2021-08-11 14:01 UTC (permalink / raw)
  To: brookxu, john.stultz, sboyd; +Cc: linux-kernel

On Wed, Aug 11 2021 at 21:18, brookxu wrote:
> Thomas Gleixner wrote on 2021/8/11 8:44 下午:
>> On Wed, Aug 11 2021 at 17:55, brookxu wrote:
>>> From: Chunguang Xu <brookxu@tencent.com>
>>>
>>> After patch 1f45f1f3 (clocksource: Make clocksource validation work
>>> for all clocksources), md_nsec may be 0 in some scenarios, such as
>>> the watchdog is delayed for a long time or the watchdog has a
>>> time-warp.
>> 
>> Maybe 0? There is exactly one single possibility for it to be zero:
>> 
>>   cs->wd_last == wdnow, i.e. delta = 0 -> wd_nsec = 0
>> 
>> So how does that condition solve any long delay or wrap around of the
>> watchdog? It's more than unlikely to hit exactly this case where the
>> readout is identical to the previous readout unless the watchdog stopped
>> counting.
>
> Maybe I missed something. Like this example, when watchdog run ,hpet have
> wrap around:
>
> 'hpet' wd_now: d76e5a69 wd_last: f929eb3c mask: ffffffff
>
> We can calculate the number of elapsed cycles:
> cycles = wd_now - wd_last = 0xde446f2d
>
> clocksource_delta() uses the MSB to determine an invalid inteval and returns
> 0, but for 0xde446f2d, this judgment should be wrong.

You're right. I forgot about the MSB check which is enabled on x86.

>>> We found a problem when testing nvme disks with fio, when multiple
>>> queue interrupts of a disk were mapped to a single CPU. IO interrupt
>>> processing will cause the watchdog to be delayed for a long time
>>> (155 seconds), the system reports TSC unstable and switches the clock
>> 
>> If you hold off the softirq from running for 155 seconds then the TSC
>> watchdog is the least of your problems.
>
> To be precise, we are processing interrupts in handle_edge_irq() for a long
> time. Since the interrupts of multiple hardware queues are mapped to a single
> CPU, multiple cores are continuously issuing IO, and then a single core is
> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
> to switching clocks in principle?

The clocksource watchdog failure is only _ONE_ consequence. Processing
hard interrupts for 155 seconds straight will trigger lockup detectors
of all sorts if you have them enabled.

So just papering over the clocksource watchdog does not solve anything,
really. Next week you have to add similar hacks to the lockup detectors,
RCU and whatever.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-11 14:01     ` Thomas Gleixner
@ 2021-08-11 15:26       ` brookxu
  2021-08-12 10:53         ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: brookxu @ 2021-08-11 15:26 UTC (permalink / raw)
  To: Thomas Gleixner, john.stultz, sboyd; +Cc: linux-kernel

Thanks for your time.

Thomas Gleixner wrote on 2021/8/11 22:01:
> On Wed, Aug 11 2021 at 21:18, brookxu wrote:
>> Thomas Gleixner wrote on 2021/8/11 8:44 下午:
>>> On Wed, Aug 11 2021 at 17:55, brookxu wrote:
>>>> From: Chunguang Xu <brookxu@tencent.com>
>>>>
>>>> After patch 1f45f1f3 (clocksource: Make clocksource validation work
>>>> for all clocksources), md_nsec may be 0 in some scenarios, such as
>>>> the watchdog is delayed for a long time or the watchdog has a
>>>> time-warp.
>>>
>>> Maybe 0? There is exactly one single possibility for it to be zero:
>>>
>>>   cs->wd_last == wdnow, i.e. delta = 0 -> wd_nsec = 0
>>>
>>> So how does that condition solve any long delay or wrap around of the
>>> watchdog? It's more than unlikely to hit exactly this case where the
>>> readout is identical to the previous readout unless the watchdog stopped
>>> counting.
>>
>> Maybe I missed something. Like this example, when watchdog run ,hpet have
>> wrap around:
>>
>> 'hpet' wd_now: d76e5a69 wd_last: f929eb3c mask: ffffffff
>>
>> We can calculate the number of elapsed cycles:
>> cycles = wd_now - wd_last = 0xde446f2d
>>
>> clocksource_delta() uses the MSB to determine an invalid inteval and returns
>> 0, but for 0xde446f2d, this judgment should be wrong.
> 
> You're right. I forgot about the MSB check which is enabled on x86.
> 
>>>> We found a problem when testing nvme disks with fio, when multiple
>>>> queue interrupts of a disk were mapped to a single CPU. IO interrupt
>>>> processing will cause the watchdog to be delayed for a long time
>>>> (155 seconds), the system reports TSC unstable and switches the clock
>>>
>>> If you hold off the softirq from running for 155 seconds then the TSC
>>> watchdog is the least of your problems.
>>
>> To be precise, we are processing interrupts in handle_edge_irq() for a long
>> time. Since the interrupts of multiple hardware queues are mapped to a single
>> CPU, multiple cores are continuously issuing IO, and then a single core is
>> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
>> to switching clocks in principle?
> 
> The clocksource watchdog failure is only _ONE_ consequence. Processing
> hard interrupts for 155 seconds straight will trigger lockup detectors
> of all sorts if you have them enabled.
> 
> So just papering over the clocksource watchdog does not solve anything,
> really. Next week you have to add similar hacks to the lockup detectors,
> RCU and whatever.

Yeah, we have observed soft lockup and RCU stall, but these behaviors are
expected because the current CPU scheduling is disabled. However, marking
TSC unstable is inconsistent with the actual situation. The worst problem
is that after the clocksource switched to hpet, the abnormal time will be
greatly prolonged due to the degradation of performance. We have not found
that soft lockup and RCU stall will affect the machine for a long time in
this test. Aside from these, as the watchdog is scheduled periodically, when
wd_nsec is 0, it means that something maybe abnormal, do we readlly still
need to continue to verify TSC? and how to ensure the correctness of the
results?

> Thanks,
> 
>         tglx
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-11 15:26       ` brookxu
@ 2021-08-12 10:53         ` Thomas Gleixner
  2021-08-13  0:54           ` brookxu
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2021-08-12 10:53 UTC (permalink / raw)
  To: brookxu, john.stultz, sboyd; +Cc: linux-kernel

On Wed, Aug 11 2021 at 23:26, brookxu wrote:
> Thomas Gleixner wrote on 2021/8/11 22:01:
>>> To be precise, we are processing interrupts in handle_edge_irq() for a long
>>> time. Since the interrupts of multiple hardware queues are mapped to a single
>>> CPU, multiple cores are continuously issuing IO, and then a single core is
>>> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
>>> to switching clocks in principle?
>> 
>> The clocksource watchdog failure is only _ONE_ consequence. Processing
>> hard interrupts for 155 seconds straight will trigger lockup detectors
>> of all sorts if you have them enabled.
>> 
>> So just papering over the clocksource watchdog does not solve anything,
>> really. Next week you have to add similar hacks to the lockup detectors,
>> RCU and whatever.
>
> Yeah, we have observed soft lockup and RCU stall, but these behaviors are
> expected because the current CPU scheduling is disabled. However, marking
> TSC unstable is inconsistent with the actual situation. The worst problem
> is that after the clocksource switched to hpet, the abnormal time will be
> greatly prolonged due to the degradation of performance. We have not found
> that soft lockup and RCU stall will affect the machine for a long time in
> this test. Aside from these, as the watchdog is scheduled periodically, when
> wd_nsec is 0, it means that something maybe abnormal, do we readlly still
> need to continue to verify TSC? and how to ensure the correctness of the
> results?

Sorry no. While softlockups and RCU stalls might have no long term
effect in the first place, this argumentation vs. the clocksource
watchdog is just a strawman. You're abusing the system in a way which
causes it to malfunction so you have to live with the consequences.

Aside of that this 'workaround' is just duct taping a particular part of
the problem. What guarantees that after the interrupt storm subsided the
clocksource delta of the watchdog becomes 0 (negative)?

Absolutely nothing. The delta can be positive, but then the watchdog and
the TSC are not in sync anymore which will disable the TSC as well.

A 24MHz HPET has a wraparound time of ~178s which means during:

  89s < tdelta < 178s

your hack papers over the problem. Any interrupt storm time outside of
that window results in fail.

Now run the same test on a machine with a 14MHz HPET and you get

 153s < tdelta < 306s

so your 155s interrupt storm barely fits. And what are you doing with
your next test which runs only 80 seconds?

Not to talk about the fact that you wreckage detection of a watchdog
clocksource going stale.

So no, we are not adding hacks to support abuse.

What we really want to do is to add detection for interrupt storms of
this sort and shut those interrupts down for good.

Thanks,

        tglx
---
Patient: "Doctor, it hurts when I hammer on my toe."
Doctor:  "Don't do that then!"



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable
  2021-08-12 10:53         ` Thomas Gleixner
@ 2021-08-13  0:54           ` brookxu
  0 siblings, 0 replies; 7+ messages in thread
From: brookxu @ 2021-08-13  0:54 UTC (permalink / raw)
  To: Thomas Gleixner, john.stultz, sboyd; +Cc: linux-kernel



Thomas Gleixner wrote on 2021/8/12 6:53 下午:
> On Wed, Aug 11 2021 at 23:26, brookxu wrote:
>> Thomas Gleixner wrote on 2021/8/11 22:01:
>>>> To be precise, we are processing interrupts in handle_edge_irq() for a long
>>>> time. Since the interrupts of multiple hardware queues are mapped to a single
>>>> CPU, multiple cores are continuously issuing IO, and then a single core is
>>>> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
>>>> to switching clocks in principle?
>>>
>>> The clocksource watchdog failure is only _ONE_ consequence. Processing
>>> hard interrupts for 155 seconds straight will trigger lockup detectors
>>> of all sorts if you have them enabled.
>>>
>>> So just papering over the clocksource watchdog does not solve anything,
>>> really. Next week you have to add similar hacks to the lockup detectors,
>>> RCU and whatever.
>>
>> Yeah, we have observed soft lockup and RCU stall, but these behaviors are
>> expected because the current CPU scheduling is disabled. However, marking
>> TSC unstable is inconsistent with the actual situation. The worst problem
>> is that after the clocksource switched to hpet, the abnormal time will be
>> greatly prolonged due to the degradation of performance. We have not found
>> that soft lockup and RCU stall will affect the machine for a long time in
>> this test. Aside from these, as the watchdog is scheduled periodically, when
>> wd_nsec is 0, it means that something maybe abnormal, do we readlly still
>> need to continue to verify TSC? and how to ensure the correctness of the
>> results?
> 
> Sorry no. While softlockups and RCU stalls might have no long term
> effect in the first place, this argumentation vs. the clocksource
> watchdog is just a strawman. You're abusing the system in a way which
> causes it to malfunction so you have to live with the consequences.
> 
> Aside of that this 'workaround' is just duct taping a particular part of
> the problem. What guarantees that after the interrupt storm subsided the
> clocksource delta of the watchdog becomes 0 (negative)?
> 
> Absolutely nothing. The delta can be positive, but then the watchdog and
> the TSC are not in sync anymore which will disable the TSC as well.
> 
> A 24MHz HPET has a wraparound time of ~178s which means during:
> 
>   89s < tdelta < 178s
> 
> your hack papers over the problem. Any interrupt storm time outside of
> that window results in fail.
> 
> Now run the same test on a machine with a 14MHz HPET and you get
> 
>  153s < tdelta < 306s
> 
> so your 155s interrupt storm barely fits. And what are you doing with
> your next test which runs only 80 seconds?
> 
> Not to talk about the fact that you wreckage detection of a watchdog
> clocksource going stale.
> 
> So no, we are not adding hacks to support abuse.
> 
> What we really want to do is to add detection for interrupt storms of
> this sort and shut those interrupts down for good.

ok, thanks for your suggestion.

> Thanks,
> 
>         tglx
> ---
> Patient: "Doctor, it hurts when I hammer on my toe."
> Doctor:  "Don't do that then!"
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-13  0:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-11  9:55 [RFC PATCH] clocksource: skip check while watchdog hung up or unstable brookxu
2021-08-11 12:44 ` Thomas Gleixner
2021-08-11 13:18   ` brookxu
2021-08-11 14:01     ` Thomas Gleixner
2021-08-11 15:26       ` brookxu
2021-08-12 10:53         ` Thomas Gleixner
2021-08-13  0:54           ` brookxu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).