All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Clocksource: Avoid misjudgment of clocksource
@ 2021-10-08  8:03 yanghui
  2021-10-08 23:45 ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-08  8:03 UTC (permalink / raw)
  To: John.stultz; +Cc: tglx, sboyd, linux-kernel, yanghui

clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
Timer. But sometimes system is very busy and the Timer cannot be
executed in 0.5sec. For example,if clocksource_watchdog be executed
after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
be enlarged. Then the current clocksource will be misjudged as
unstable. So we add conditions to prevent the clocksource from
being misjudged.

Signed-off-by: yanghui <yanghui.def@bytedance.com>
---
 kernel/time/clocksource.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..d535beadcbc8 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
 
 /*
  * Interval: 0.5sec.
+ * MaxInterval: 1s.
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
 
 static void clocksource_watchdog_work(struct work_struct *work)
 {
@@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
-		if (abs(cs_nsec - wd_nsec) > md) {
+		if ((abs(cs_nsec - wd_nsec) > md) &&
+			cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
+			wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
 			pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
 				smp_processor_id(), cs->name);
 			pr_warn("                      '%s' wd_nsec: %lld wd_now: %llx wd_last: %llx mask: %llx\n",
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-08  8:03 [PATCH] Clocksource: Avoid misjudgment of clocksource yanghui
@ 2021-10-08 23:45 ` John Stultz
  2021-10-09  3:22   ` [External] " yanghui
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: John Stultz @ 2021-10-08 23:45 UTC (permalink / raw)
  To: yanghui; +Cc: Thomas Gleixner, Stephen Boyd, lkml

On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>
> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> Timer. But sometimes system is very busy and the Timer cannot be
> executed in 0.5sec. For example,if clocksource_watchdog be executed
> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> be enlarged. Then the current clocksource will be misjudged as
> unstable. So we add conditions to prevent the clocksource from
> being misjudged.
>
> Signed-off-by: yanghui <yanghui.def@bytedance.com>
> ---
>  kernel/time/clocksource.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..d535beadcbc8 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>
>  /*
>   * Interval: 0.5sec.
> + * MaxInterval: 1s.
>   */
>  #define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>
>  static void clocksource_watchdog_work(struct work_struct *work)
>  {
> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>
>                 /* Check the deviation from the watchdog clocksource. */
>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> -               if (abs(cs_nsec - wd_nsec) > md) {
> +               if ((abs(cs_nsec - wd_nsec) > md) &&
> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&

Sorry, it's been awhile since I looked at this code, but why are you
bounding the clocksource delta here?
It seems like if the clocksource being watched was very wrong (with a
delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.

> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {

Bounding the watchdog interval on the check does seem reasonable.
Though one may want to keep track that if we are seeing too many of
these delayed watchdog checks we provide some feedback via dmesg.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-08 23:45 ` John Stultz
@ 2021-10-09  3:22   ` yanghui
  2021-10-09  3:38     ` John Stultz
  2021-10-09 14:04   ` brookxu
  2021-12-09 13:14   ` Gang Li
  2 siblings, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-09  3:22 UTC (permalink / raw)
  To: John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml



在 2021/10/9 上午7:45, John Stultz 写道:
> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>
>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>> Timer. But sometimes system is very busy and the Timer cannot be
>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>> be enlarged. Then the current clocksource will be misjudged as
>> unstable. So we add conditions to prevent the clocksource from
>> being misjudged.
>>
>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>> ---
>>   kernel/time/clocksource.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index b8a14d2fb5ba..d535beadcbc8 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>
>>   /*
>>    * Interval: 0.5sec.
>> + * MaxInterval: 1s.
>>    */
>>   #define WATCHDOG_INTERVAL (HZ >> 1)
>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>
>>   static void clocksource_watchdog_work(struct work_struct *work)
>>   {
>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>
>>                  /* Check the deviation from the watchdog clocksource. */
>>                  md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>> -               if (abs(cs_nsec - wd_nsec) > md) {
>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> 
> Sorry, it's been awhile since I looked at this code, but why are you
> bounding the clocksource delta here?
> It seems like if the clocksource being watched was very wrong (with a
> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> 
>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> 
> Bounding the watchdog interval on the check does seem reasonable.
> Though one may want to keep track that if we are seeing too many of
> these delayed watchdog checks we provide some feedback via dmesg.

   Yes, only to check watchdog delta is more reasonable.
   I think Only have dmesg is not enough, because if tsc was be misjudged
   as unstable then switch to hpet. And hpet is very expensive for
   performance, so if we want to switch to tsc the only way is to reboot
   the server. We need to prevent the switching of the clock source in
   case of misjudgment.
   Circumstances of misjudgment:
   if clocksource_watchdog is executed after 10sec, the value of wd_delta
   and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
   will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
   But now clocksource is accurate, the Timer is inaccurate. So we 
misjudged the clocksource. We need avoid this from happening.

thanks
-Hui
> 
> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-09  3:22   ` [External] " yanghui
@ 2021-10-09  3:38     ` John Stultz
  2021-10-09  9:02       ` yanghui
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-09  3:38 UTC (permalink / raw)
  To: yanghui; +Cc: Thomas Gleixner, Stephen Boyd, lkml

On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@bytedance.com> wrote:
> 在 2021/10/9 上午7:45, John Stultz 写道:
> > On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
> >>
> >> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> >> Timer. But sometimes system is very busy and the Timer cannot be
> >> executed in 0.5sec. For example,if clocksource_watchdog be executed
> >> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> >> be enlarged. Then the current clocksource will be misjudged as
> >> unstable. So we add conditions to prevent the clocksource from
> >> being misjudged.
> >>
> >> Signed-off-by: yanghui <yanghui.def@bytedance.com>
> >> ---
> >>   kernel/time/clocksource.c | 6 +++++-
> >>   1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >> index b8a14d2fb5ba..d535beadcbc8 100644
> >> --- a/kernel/time/clocksource.c
> >> +++ b/kernel/time/clocksource.c
> >> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>
> >>   /*
> >>    * Interval: 0.5sec.
> >> + * MaxInterval: 1s.
> >>    */
> >>   #define WATCHDOG_INTERVAL (HZ >> 1)
> >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>
> >>   static void clocksource_watchdog_work(struct work_struct *work)
> >>   {
> >> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>
> >>                  /* Check the deviation from the watchdog clocksource. */
> >>                  md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >> -               if (abs(cs_nsec - wd_nsec) > md) {
> >> +               if ((abs(cs_nsec - wd_nsec) > md) &&
> >> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >
> > Sorry, it's been awhile since I looked at this code, but why are you
> > bounding the clocksource delta here?
> > It seems like if the clocksource being watched was very wrong (with a
> > delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> >
> >> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >
> > Bounding the watchdog interval on the check does seem reasonable.
> > Though one may want to keep track that if we are seeing too many of
> > these delayed watchdog checks we provide some feedback via dmesg.
>
>    Yes, only to check watchdog delta is more reasonable.
>    I think Only have dmesg is not enough, because if tsc was be misjudged
>    as unstable then switch to hpet. And hpet is very expensive for
>    performance, so if we want to switch to tsc the only way is to reboot
>    the server. We need to prevent the switching of the clock source in
>    case of misjudgment.
>    Circumstances of misjudgment:
>    if clocksource_watchdog is executed after 10sec, the value of wd_delta
>    and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>    will be magnified 20 times(10sec/0.5sec).The delta value is magnified.

Yea, it might be worth calculating an error rate instead of assuming
the interval is fixed, but also just skipping the check may be
reasonable assuming timers aren't constantly being delayed (and it's
more of a transient state).

At some point if the watchdog timer is delayed too much, the watchdog
hardware will fully wrap and one can no longer properly compare
intervals. That's why the timer length is chosen as such, so having
that timer delayed is really pushing the system into a potentially bad
state where other subtle problems are likely to crop up.

So I do worry these watchdog robustness fixes are papering over a
problem, pushing expectations closer to the edge of how far the system
should tolerate bad behavior. Because at some point we'll fall off. :)

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-09  3:38     ` John Stultz
@ 2021-10-09  9:02       ` yanghui
  2021-10-12  5:02         ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-09  9:02 UTC (permalink / raw)
  To: John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml



在 2021/10/9 上午11:38, John Stultz 写道:
> On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@bytedance.com> wrote:
>> 在 2021/10/9 上午7:45, John Stultz 写道:
>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>>>
>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>> be enlarged. Then the current clocksource will be misjudged as
>>>> unstable. So we add conditions to prevent the clocksource from
>>>> being misjudged.
>>>>
>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>> ---
>>>>    kernel/time/clocksource.c | 6 +++++-
>>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>> --- a/kernel/time/clocksource.c
>>>> +++ b/kernel/time/clocksource.c
>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>
>>>>    /*
>>>>     * Interval: 0.5sec.
>>>> + * MaxInterval: 1s.
>>>>     */
>>>>    #define WATCHDOG_INTERVAL (HZ >> 1)
>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>
>>>>    static void clocksource_watchdog_work(struct work_struct *work)
>>>>    {
>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>
>>>>                   /* Check the deviation from the watchdog clocksource. */
>>>>                   md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>
>>> Sorry, it's been awhile since I looked at this code, but why are you
>>> bounding the clocksource delta here?
>>> It seems like if the clocksource being watched was very wrong (with a
>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>
>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>
>>> Bounding the watchdog interval on the check does seem reasonable.
>>> Though one may want to keep track that if we are seeing too many of
>>> these delayed watchdog checks we provide some feedback via dmesg.
>>
>>     Yes, only to check watchdog delta is more reasonable.
>>     I think Only have dmesg is not enough, because if tsc was be misjudged
>>     as unstable then switch to hpet. And hpet is very expensive for
>>     performance, so if we want to switch to tsc the only way is to reboot
>>     the server. We need to prevent the switching of the clock source in
>>     case of misjudgment.
>>     Circumstances of misjudgment:
>>     if clocksource_watchdog is executed after 10sec, the value of wd_delta
>>     and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>>     will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
> 
> Yea, it might be worth calculating an error rate instead of assuming
> the interval is fixed, but also just skipping the check may be
> reasonable assuming timers aren't constantly being delayed (and it's
> more of a transient state).
> 
> At some point if the watchdog timer is delayed too much, the watchdog
I mean the execution cycle of this function(static void 
clocksource_watchdog(struct timer_list *unused)) has been delayed.

> hardware will fully wrap and one can no longer properly compare
> intervals. That's why the timer length is chosen as such, so having
> that timer delayed is really pushing the system into a potentially bad
> state where other subtle problems are likely to crop up.
> 
> So I do worry these watchdog robustness fixes are papering over a
> problem, pushing expectations closer to the edge of how far the system
> should tolerate bad behavior. Because at some point we'll fall off. :)

Sorry,I don't seem to understand what you mean. Should I send your Patch 
v2 ?

thanks
-hui
> 
> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-08 23:45 ` John Stultz
  2021-10-09  3:22   ` [External] " yanghui
@ 2021-10-09 14:04   ` brookxu
  2021-10-12  4:52     ` John Stultz
  2021-12-09 13:14   ` Gang Li
  2 siblings, 1 reply; 29+ messages in thread
From: brookxu @ 2021-10-09 14:04 UTC (permalink / raw)
  To: John Stultz, yanghui; +Cc: Thomas Gleixner, Stephen Boyd, lkml

hello

John Stultz wrote on 2021/10/9 7:45:
> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>
>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>> Timer. But sometimes system is very busy and the Timer cannot be
>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>> be enlarged. Then the current clocksource will be misjudged as
>> unstable. So we add conditions to prevent the clocksource from
>> being misjudged.
>>
>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>> ---
>>  kernel/time/clocksource.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index b8a14d2fb5ba..d535beadcbc8 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>
>>  /*
>>   * Interval: 0.5sec.
>> + * MaxInterval: 1s.
>>   */
>>  #define WATCHDOG_INTERVAL (HZ >> 1)
>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>
>>  static void clocksource_watchdog_work(struct work_struct *work)
>>  {
>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>
>>                 /* Check the deviation from the watchdog clocksource. */
>>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>> -               if (abs(cs_nsec - wd_nsec) > md) {
>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> 
> Sorry, it's been awhile since I looked at this code, but why are you
> bounding the clocksource delta here?
> It seems like if the clocksource being watched was very wrong (with a
> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> 
>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> 
> Bounding the watchdog interval on the check does seem reasonable.
> Though one may want to keep track that if we are seeing too many of
> these delayed watchdog checks we provide some feedback via dmesg.

For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
make much sense, because when wacthdog is called, the timer may overflow many
times.

> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-09 14:04   ` brookxu
@ 2021-10-12  4:52     ` John Stultz
  2021-10-12  5:21       ` brookxu
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-12  4:52 UTC (permalink / raw)
  To: brookxu; +Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml

On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>
> hello
>
> John Stultz wrote on 2021/10/9 7:45:
> > On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
> >>
> >> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> >> Timer. But sometimes system is very busy and the Timer cannot be
> >> executed in 0.5sec. For example,if clocksource_watchdog be executed
> >> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> >> be enlarged. Then the current clocksource will be misjudged as
> >> unstable. So we add conditions to prevent the clocksource from
> >> being misjudged.
> >>
> >> Signed-off-by: yanghui <yanghui.def@bytedance.com>
> >> ---
> >>  kernel/time/clocksource.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >> index b8a14d2fb5ba..d535beadcbc8 100644
> >> --- a/kernel/time/clocksource.c
> >> +++ b/kernel/time/clocksource.c
> >> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>
> >>  /*
> >>   * Interval: 0.5sec.
> >> + * MaxInterval: 1s.
> >>   */
> >>  #define WATCHDOG_INTERVAL (HZ >> 1)
> >> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>
> >>  static void clocksource_watchdog_work(struct work_struct *work)
> >>  {
> >> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>
> >>                 /* Check the deviation from the watchdog clocksource. */
> >>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >> -               if (abs(cs_nsec - wd_nsec) > md) {
> >> +               if ((abs(cs_nsec - wd_nsec) > md) &&
> >> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >
> > Sorry, it's been awhile since I looked at this code, but why are you
> > bounding the clocksource delta here?
> > It seems like if the clocksource being watched was very wrong (with a
> > delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> >
> >> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >
> > Bounding the watchdog interval on the check does seem reasonable.
> > Though one may want to keep track that if we are seeing too many of
> > these delayed watchdog checks we provide some feedback via dmesg.
>
> For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
> make much sense, because when wacthdog is called, the timer may overflow many
> times.

Indeed. But in that case we can't tell which way is up. This is what I
was fretting about when I said:
> So I do worry these watchdog robustness fixes are papering over a
> problem, pushing expectations closer to the edge of how far the system
> should tolerate bad behavior. Because at some point we'll fall off. :)

If the timer is delayed long enough for the watchdog to wrap, we're
way out of tolerable behavior. There's not much we can do because we
can't even tell what happened.

But in the case where the watchdog has not wrapped, I don't see a
major issue with trying to be a bit more robust in the face of just
slightly delayed timers.
(And yes, we can't really distinguish between slightly delayed and
watchdog-wrap-interval + slight delay, but in either case we can
probably skip disqualifying the clocksource as we know something seems
off)

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-09  9:02       ` yanghui
@ 2021-10-12  5:02         ` John Stultz
  2021-10-18 10:41           ` yanghui
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-12  5:02 UTC (permalink / raw)
  To: yanghui; +Cc: Thomas Gleixner, Stephen Boyd, lkml

On Sat, Oct 9, 2021 at 2:02 AM yanghui <yanghui.def@bytedance.com> wrote:
>
>
>
> 在 2021/10/9 上午11:38, John Stultz 写道:
> > On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@bytedance.com> wrote:
> >> 在 2021/10/9 上午7:45, John Stultz 写道:
> >>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
> >>>>
> >>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> >>>> Timer. But sometimes system is very busy and the Timer cannot be
> >>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
> >>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> >>>> be enlarged. Then the current clocksource will be misjudged as
> >>>> unstable. So we add conditions to prevent the clocksource from
> >>>> being misjudged.
> >>>>
> >>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
> >>>> ---
> >>>>    kernel/time/clocksource.c | 6 +++++-
> >>>>    1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >>>> index b8a14d2fb5ba..d535beadcbc8 100644
> >>>> --- a/kernel/time/clocksource.c
> >>>> +++ b/kernel/time/clocksource.c
> >>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>>>
> >>>>    /*
> >>>>     * Interval: 0.5sec.
> >>>> + * MaxInterval: 1s.
> >>>>     */
> >>>>    #define WATCHDOG_INTERVAL (HZ >> 1)
> >>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>>>
> >>>>    static void clocksource_watchdog_work(struct work_struct *work)
> >>>>    {
> >>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>>>
> >>>>                   /* Check the deviation from the watchdog clocksource. */
> >>>>                   md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >>>> -               if (abs(cs_nsec - wd_nsec) > md) {
> >>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
> >>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >>>
> >>> Sorry, it's been awhile since I looked at this code, but why are you
> >>> bounding the clocksource delta here?
> >>> It seems like if the clocksource being watched was very wrong (with a
> >>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> >>>
> >>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >>>
> >>> Bounding the watchdog interval on the check does seem reasonable.
> >>> Though one may want to keep track that if we are seeing too many of
> >>> these delayed watchdog checks we provide some feedback via dmesg.
> >>
> >>     Yes, only to check watchdog delta is more reasonable.
> >>     I think Only have dmesg is not enough, because if tsc was be misjudged
> >>     as unstable then switch to hpet. And hpet is very expensive for
> >>     performance, so if we want to switch to tsc the only way is to reboot
> >>     the server. We need to prevent the switching of the clock source in
> >>     case of misjudgment.
> >>     Circumstances of misjudgment:
> >>     if clocksource_watchdog is executed after 10sec, the value of wd_delta
> >>     and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
> >>     will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
> >
> > Yea, it might be worth calculating an error rate instead of assuming
> > the interval is fixed, but also just skipping the check may be
> > reasonable assuming timers aren't constantly being delayed (and it's
> > more of a transient state).
> >
> > At some point if the watchdog timer is delayed too much, the watchdog
> I mean the execution cycle of this function(static void
> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>
> > hardware will fully wrap and one can no longer properly compare
> > intervals. That's why the timer length is chosen as such, so having
> > that timer delayed is really pushing the system into a potentially bad
> > state where other subtle problems are likely to crop up.
> >
> > So I do worry these watchdog robustness fixes are papering over a
> > problem, pushing expectations closer to the edge of how far the system
> > should tolerate bad behavior. Because at some point we'll fall off. :)
>
> Sorry,I don't seem to understand what you mean. Should I send your Patch
> v2 ?

Sending a v2 is usually a good step (persistence is key! :)

I'm sorry for being unclear in the above. I'm mostly just fretting
that the watchdog logic has inherent assumptions that the timers won't
be greatly delayed. Unfortunately the reality is that the timers may
be delayed. So we can try to add some robustness (as your patch does),
but at a certain point, the delays may exceed what the logic can
tolerate and produce correct behavior. I worry that by pushing the
robustness up to that limit, folks may not recognize the problematic
behavior (greatly delayed timers - possibly caused by drivers
disabling irqs for too long, or bad SMI logic, or long virtualization
pauses), and think the system is still working as designed, even
though its regularly exceeding the bounds of the assumptions in the
code. So without any feedback that something is wrong, those bounds
will continue to be pushed until things really break in a way we
cannot be robust about.

That's why I was suggesting adding some sort of printk warning when we
do see a number of delayed timers so that folks have some signal that
things are not as they are expected to be.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  4:52     ` John Stultz
@ 2021-10-12  5:21       ` brookxu
  2021-10-12  5:29         ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: brookxu @ 2021-10-12  5:21 UTC (permalink / raw)
  To: John Stultz; +Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml



John Stultz wrote on 2021/10/12 12:52 下午:
> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>>
>> hello
>>
>> John Stultz wrote on 2021/10/9 7:45:
>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>>>
>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>> be enlarged. Then the current clocksource will be misjudged as
>>>> unstable. So we add conditions to prevent the clocksource from
>>>> being misjudged.
>>>>
>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>> ---
>>>>  kernel/time/clocksource.c | 6 +++++-
>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>> --- a/kernel/time/clocksource.c
>>>> +++ b/kernel/time/clocksource.c
>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>
>>>>  /*
>>>>   * Interval: 0.5sec.
>>>> + * MaxInterval: 1s.
>>>>   */
>>>>  #define WATCHDOG_INTERVAL (HZ >> 1)
>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>
>>>>  static void clocksource_watchdog_work(struct work_struct *work)
>>>>  {
>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>
>>>>                 /* Check the deviation from the watchdog clocksource. */
>>>>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>
>>> Sorry, it's been awhile since I looked at this code, but why are you
>>> bounding the clocksource delta here?
>>> It seems like if the clocksource being watched was very wrong (with a
>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>
>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>
>>> Bounding the watchdog interval on the check does seem reasonable.
>>> Though one may want to keep track that if we are seeing too many of
>>> these delayed watchdog checks we provide some feedback via dmesg.
>>
>> For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
>> make much sense, because when wacthdog is called, the timer may overflow many
>> times.
> 
> Indeed. But in that case we can't tell which way is up. This is what I
> was fretting about when I said:
>> So I do worry these watchdog robustness fixes are papering over a
>> problem, pushing expectations closer to the edge of how far the system
>> should tolerate bad behavior. Because at some point we'll fall off. :)
> 
> If the timer is delayed long enough for the watchdog to wrap, we're
> way out of tolerable behavior. There's not much we can do because we
> can't even tell what happened.
> 
> But in the case where the watchdog has not wrapped, I don't see a
> major issue with trying to be a bit more robust in the face of just
> slightly delayed timers.
> (And yes, we can't really distinguish between slightly delayed and
> watchdog-wrap-interval + slight delay, but in either case we can
> probably skip disqualifying the clocksource as we know something seems
> off)

If we record the watchdog's start_time in clocksource_start_watchdog(), and then
when we verify cycles in clocksource_watchdog(), check whether the clocksource
watchdog is blocked. Due to MSB verification, if the blocked time is greater than
half of the watchdog timer max_cycles, then we can safely ignore the current
verification? Do you think this idea is okay?

> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  5:21       ` brookxu
@ 2021-10-12  5:29         ` John Stultz
  2021-10-12  8:06           ` brookxu
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-12  5:29 UTC (permalink / raw)
  To: brookxu; +Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml

On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> John Stultz wrote on 2021/10/12 12:52 下午:
> > On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> >>
> >> hello
> >>
> >> John Stultz wrote on 2021/10/9 7:45:
> >>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
> >>>>
> >>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> >>>> Timer. But sometimes system is very busy and the Timer cannot be
> >>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
> >>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> >>>> be enlarged. Then the current clocksource will be misjudged as
> >>>> unstable. So we add conditions to prevent the clocksource from
> >>>> being misjudged.
> >>>>
> >>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
> >>>> ---
> >>>>  kernel/time/clocksource.c | 6 +++++-
> >>>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >>>> index b8a14d2fb5ba..d535beadcbc8 100644
> >>>> --- a/kernel/time/clocksource.c
> >>>> +++ b/kernel/time/clocksource.c
> >>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>>>
> >>>>  /*
> >>>>   * Interval: 0.5sec.
> >>>> + * MaxInterval: 1s.
> >>>>   */
> >>>>  #define WATCHDOG_INTERVAL (HZ >> 1)
> >>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>>>
> >>>>  static void clocksource_watchdog_work(struct work_struct *work)
> >>>>  {
> >>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>>>
> >>>>                 /* Check the deviation from the watchdog clocksource. */
> >>>>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >>>> -               if (abs(cs_nsec - wd_nsec) > md) {
> >>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
> >>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >>>
> >>> Sorry, it's been awhile since I looked at this code, but why are you
> >>> bounding the clocksource delta here?
> >>> It seems like if the clocksource being watched was very wrong (with a
> >>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> >>>
> >>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >>>
> >>> Bounding the watchdog interval on the check does seem reasonable.
> >>> Though one may want to keep track that if we are seeing too many of
> >>> these delayed watchdog checks we provide some feedback via dmesg.
> >>
> >> For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
> >> make much sense, because when wacthdog is called, the timer may overflow many
> >> times.
> >
> > Indeed. But in that case we can't tell which way is up. This is what I
> > was fretting about when I said:
> >> So I do worry these watchdog robustness fixes are papering over a
> >> problem, pushing expectations closer to the edge of how far the system
> >> should tolerate bad behavior. Because at some point we'll fall off. :)
> >
> > If the timer is delayed long enough for the watchdog to wrap, we're
> > way out of tolerable behavior. There's not much we can do because we
> > can't even tell what happened.
> >
> > But in the case where the watchdog has not wrapped, I don't see a
> > major issue with trying to be a bit more robust in the face of just
> > slightly delayed timers.
> > (And yes, we can't really distinguish between slightly delayed and
> > watchdog-wrap-interval + slight delay, but in either case we can
> > probably skip disqualifying the clocksource as we know something seems
> > off)
>
> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> half of the watchdog timer max_cycles, then we can safely ignore the current
> verification? Do you think this idea is okay?

I can't say I totally understand the idea. Maybe could you clarify with a patch?

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  5:29         ` John Stultz
@ 2021-10-12  8:06           ` brookxu
  2021-10-14  7:03             ` yanghui
  2021-10-18 16:14             ` John Stultz
  0 siblings, 2 replies; 29+ messages in thread
From: brookxu @ 2021-10-12  8:06 UTC (permalink / raw)
  To: John Stultz; +Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml



John Stultz wrote on 2021/10/12 13:29:
> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
>> John Stultz wrote on 2021/10/12 12:52 下午:
>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>>>>
>>>> hello
>>>>
>>>> John Stultz wrote on 2021/10/9 7:45:
>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>>>>>
>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>> being misjudged.
>>>>>>
>>>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>>>> ---
>>>>>>  kernel/time/clocksource.c | 6 +++++-
>>>>>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>> --- a/kernel/time/clocksource.c
>>>>>> +++ b/kernel/time/clocksource.c
>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>
>>>>>>  /*
>>>>>>   * Interval: 0.5sec.
>>>>>> + * MaxInterval: 1s.
>>>>>>   */
>>>>>>  #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>
>>>>>>  static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>  {
>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>
>>>>>>                 /* Check the deviation from the watchdog clocksource. */
>>>>>>                 md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>
>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>> bounding the clocksource delta here?
>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>
>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>
>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>> Though one may want to keep track that if we are seeing too many of
>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>
>>>> For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
>>>> make much sense, because when wacthdog is called, the timer may overflow many
>>>> times.
>>>
>>> Indeed. But in that case we can't tell which way is up. This is what I
>>> was fretting about when I said:
>>>> So I do worry these watchdog robustness fixes are papering over a
>>>> problem, pushing expectations closer to the edge of how far the system
>>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>>>
>>> If the timer is delayed long enough for the watchdog to wrap, we're
>>> way out of tolerable behavior. There's not much we can do because we
>>> can't even tell what happened.
>>>
>>> But in the case where the watchdog has not wrapped, I don't see a
>>> major issue with trying to be a bit more robust in the face of just
>>> slightly delayed timers.
>>> (And yes, we can't really distinguish between slightly delayed and
>>> watchdog-wrap-interval + slight delay, but in either case we can
>>> probably skip disqualifying the clocksource as we know something seems
>>> off)
>>
>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
>> half of the watchdog timer max_cycles, then we can safely ignore the current
>> verification? Do you think this idea is okay?
> 
> I can't say I totally understand the idea. Maybe could you clarify with a patch?
>

Sorry, it looks almost as follows:

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2..87f3b67 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -119,6 +119,7 @@
 static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
 static DEFINE_SPINLOCK(watchdog_lock);
 static int watchdog_running;
+static unsigned long watchdog_start_time;
 static atomic_t watchdog_reset_pending;
 
 static inline void clocksource_watchdog_lock(unsigned long *flags)
@@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
+	unsigned long max_jiffies;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
+		max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
+		if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
+			continue;
+
 		/* Check the deviation from the watchdog clocksource. */
 		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
 		if (abs(cs_nsec - wd_nsec) > md) {
@@ -474,6 +480,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
 	 */
 	if (!timer_pending(&watchdog_timer)) {
+		watchdog_start_time = jiffies;
 		watchdog_timer.expires += WATCHDOG_INTERVAL;
 		add_timer_on(&watchdog_timer, next_cpu);
 	}
@@ -488,6 +495,7 @@ static inline void clocksource_start_watchdog(void)
 	timer_setup(&watchdog_timer, clocksource_watchdog, 0);
 	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
 	add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
+	watchdog_start_time = jiffies;
 	watchdog_running = 1;
 }


> thanks
> -john
> 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  8:06           ` brookxu
@ 2021-10-14  7:03             ` yanghui
  2021-10-15  6:56               ` yanghui
  2021-10-18 16:14             ` John Stultz
  1 sibling, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-14  7:03 UTC (permalink / raw)
  To: brookxu, John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml



在 2021/10/12 下午4:06, brookxu 写道:
> 
> 
> John Stultz wrote on 2021/10/12 13:29:
>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
>>> John Stultz wrote on 2021/10/12 12:52 下午:
>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>>>>>
>>>>> hello
>>>>>
>>>>> John Stultz wrote on 2021/10/9 7:45:
>>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>>>>>>
>>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>>> being misjudged.
>>>>>>>
>>>>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>>>>> ---
>>>>>>>   kernel/time/clocksource.c | 6 +++++-
>>>>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>>> --- a/kernel/time/clocksource.c
>>>>>>> +++ b/kernel/time/clocksource.c
>>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>>
>>>>>>>   /*
>>>>>>>    * Interval: 0.5sec.
>>>>>>> + * MaxInterval: 1s.
>>>>>>>    */
>>>>>>>   #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>>
>>>>>>>   static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>>   {
>>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>>
>>>>>>>                  /* Check the deviation from the watchdog clocksource. */
>>>>>>>                  md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>>
>>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>>> bounding the clocksource delta here?
>>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>>
>>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>>
>>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>>> Though one may want to keep track that if we are seeing too many of
>>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>>
>>>>> For some fast timeout timers, such as acpi-timer, checking wd_nsec should not
>>>>> make much sense, because when wacthdog is called, the timer may overflow many
>>>>> times.
>>>>
>>>> Indeed. But in that case we can't tell which way is up. This is what I
>>>> was fretting about when I said:
>>>>> So I do worry these watchdog robustness fixes are papering over a
>>>>> problem, pushing expectations closer to the edge of how far the system
>>>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>>>>
>>>> If the timer is delayed long enough for the watchdog to wrap, we're
>>>> way out of tolerable behavior. There's not much we can do because we
>>>> can't even tell what happened.
>>>>
>>>> But in the case where the watchdog has not wrapped, I don't see a
>>>> major issue with trying to be a bit more robust in the face of just
>>>> slightly delayed timers.
>>>> (And yes, we can't really distinguish between slightly delayed and
>>>> watchdog-wrap-interval + slight delay, but in either case we can
>>>> probably skip disqualifying the clocksource as we know something seems
>>>> off)
>>>
>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
>>> half of the watchdog timer max_cycles, then we can safely ignore the current
>>> verification? Do you think this idea is okay?
>>
>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
>>
> 
> Sorry, it looks almost as follows:
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2..87f3b67 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -119,6 +119,7 @@
>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>   static DEFINE_SPINLOCK(watchdog_lock);
>   static int watchdog_running;
> +static unsigned long watchdog_start_time;
>   static atomic_t watchdog_reset_pending;
>   
>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>   	int next_cpu, reset_pending;
>   	int64_t wd_nsec, cs_nsec;
>   	struct clocksource *cs;
> +	unsigned long max_jiffies;
>   	u32 md;
>   
>   	spin_lock(&watchdog_lock);
> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
>   		if (atomic_read(&watchdog_reset_pending))
>   			continue;
>   
> +		max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> +		if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> +			continue;
> +
>   		/* Check the deviation from the watchdog clocksource. */
>   		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>   		if (abs(cs_nsec - wd_nsec) > md) {
> @@ -474,6 +480,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>   	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>   	 */
>   	if (!timer_pending(&watchdog_timer)) {
> +		watchdog_start_time = jiffies;
>   		watchdog_timer.expires += WATCHDOG_INTERVAL;
>   		add_timer_on(&watchdog_timer, next_cpu);
>   	}
> @@ -488,6 +495,7 @@ static inline void clocksource_start_watchdog(void)
>   	timer_setup(&watchdog_timer, clocksource_watchdog, 0);
>   	watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
>   	add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
> +	watchdog_start_time = jiffies;
>   	watchdog_running = 1;
>   }
> 
> 
>> thanks
>> -john
>>

It looks good to me.
thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-14  7:03             ` yanghui
@ 2021-10-15  6:56               ` yanghui
  2021-10-15  7:24                 ` brookxu
  0 siblings, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-15  6:56 UTC (permalink / raw)
  To: brookxu, John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml



在 2021/10/14 下午3:03, yanghui 写道:
> 
> 
> 在 2021/10/12 下午4:06, brookxu 写道:
>>
>>
>> John Stultz wrote on 2021/10/12 13:29:
>>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
>>>> John Stultz wrote on 2021/10/12 12:52 下午:
>>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>>>>>>
>>>>>> hello
>>>>>>
>>>>>> John Stultz wrote on 2021/10/9 7:45:
>>>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui 
>>>>>>> <yanghui.def@bytedance.com> wrote:
>>>>>>>>
>>>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>>>> being misjudged.
>>>>>>>>
>>>>>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>>>>>> ---
>>>>>>>>   kernel/time/clocksource.c | 6 +++++-
>>>>>>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>>>> --- a/kernel/time/clocksource.c
>>>>>>>> +++ b/kernel/time/clocksource.c
>>>>>>>> @@ -136,8 +136,10 @@ static void 
>>>>>>>> __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>>>
>>>>>>>>   /*
>>>>>>>>    * Interval: 0.5sec.
>>>>>>>> + * MaxInterval: 1s.
>>>>>>>>    */
>>>>>>>>   #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>>>
>>>>>>>>   static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>>>   {
>>>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct 
>>>>>>>> timer_list *unused)
>>>>>>>>
>>>>>>>>                  /* Check the deviation from the watchdog 
>>>>>>>> clocksource. */
>>>>>>>>                  md = cs->uncertainty_margin + 
>>>>>>>> watchdog->uncertainty_margin;
>>>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>>>
>>>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>>>> bounding the clocksource delta here?
>>>>>>> It seems like if the clocksource being watched was very wrong 
>>>>>>> (with a
>>>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>>>
>>>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>>>
>>>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>>>> Though one may want to keep track that if we are seeing too many of
>>>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>>>
>>>>>> For some fast timeout timers, such as acpi-timer, checking wd_nsec 
>>>>>> should not
>>>>>> make much sense, because when wacthdog is called, the timer may 
>>>>>> overflow many
>>>>>> times.
>>>>>
>>>>> Indeed. But in that case we can't tell which way is up. This is what I
>>>>> was fretting about when I said:
>>>>>> So I do worry these watchdog robustness fixes are papering over a
>>>>>> problem, pushing expectations closer to the edge of how far the 
>>>>>> system
>>>>>> should tolerate bad behavior. Because at some point we'll fall 
>>>>>> off. :)
>>>>>
>>>>> If the timer is delayed long enough for the watchdog to wrap, we're
>>>>> way out of tolerable behavior. There's not much we can do because we
>>>>> can't even tell what happened.
>>>>>
>>>>> But in the case where the watchdog has not wrapped, I don't see a
>>>>> major issue with trying to be a bit more robust in the face of just
>>>>> slightly delayed timers.
>>>>> (And yes, we can't really distinguish between slightly delayed and
>>>>> watchdog-wrap-interval + slight delay, but in either case we can
>>>>> probably skip disqualifying the clocksource as we know something seems
>>>>> off)
>>>>
>>>> If we record the watchdog's start_time in 
>>>> clocksource_start_watchdog(), and then
>>>> when we verify cycles in clocksource_watchdog(), check whether the 
>>>> clocksource
>>>> watchdog is blocked. Due to MSB verification, if the blocked time is 
>>>> greater than
>>>> half of the watchdog timer max_cycles, then we can safely ignore the 
>>>> current
>>>> verification? Do you think this idea is okay?
>>>
>>> I can't say I totally understand the idea. Maybe could you clarify 
>>> with a patch?
>>>
>>
>> Sorry, it looks almost as follows:
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index b8a14d2..87f3b67 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -119,6 +119,7 @@
>>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>>   static DEFINE_SPINLOCK(watchdog_lock);
>>   static int watchdog_running;
>> +static unsigned long watchdog_start_time;
>>   static atomic_t watchdog_reset_pending;
>>   static inline void clocksource_watchdog_lock(unsigned long *flags)
>> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list 
>> *unused)
>>       int next_cpu, reset_pending;
>>       int64_t wd_nsec, cs_nsec;
>>       struct clocksource *cs;
>> +    unsigned long max_jiffies;
>>       u32 md;
>>       spin_lock(&watchdog_lock);
>> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct 
>> timer_list *unused)
>>           if (atomic_read(&watchdog_reset_pending))
>>               continue;
>> +        max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
>> +        if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
>> +            continue;
>> +

Hi John:
What do you think of this suggest?If the interval between two 
executions of the function clocksource_watchdog() exceeds max_idle_ns. 
We think the current judement is unreasonable,so we skip this judgment.

>>           /* Check the deviation from the watchdog clocksource. */
>>           md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>           if (abs(cs_nsec - wd_nsec) > md) {
>> @@ -474,6 +480,7 @@ static void clocksource_watchdog(struct timer_list 
>> *unused)
>>        * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>>        */
>>       if (!timer_pending(&watchdog_timer)) {
>> +        watchdog_start_time = jiffies;
>>           watchdog_timer.expires += WATCHDOG_INTERVAL;
>>           add_timer_on(&watchdog_timer, next_cpu);
>>       }
>> @@ -488,6 +495,7 @@ static inline void clocksource_start_watchdog(void)
>>       timer_setup(&watchdog_timer, clocksource_watchdog, 0);
>>       watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
>>       add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
>> +    watchdog_start_time = jiffies;
>>       watchdog_running = 1;
>>   }
>>
>>
>>> thanks
>>> -john
>>>
> 
> It looks good to me.
> thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-15  6:56               ` yanghui
@ 2021-10-15  7:24                 ` brookxu
  0 siblings, 0 replies; 29+ messages in thread
From: brookxu @ 2021-10-15  7:24 UTC (permalink / raw)
  To: yanghui, John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml

Hello

>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>> index b8a14d2..87f3b67 100644
>>> --- a/kernel/time/clocksource.c
>>> +++ b/kernel/time/clocksource.c
>>> @@ -119,6 +119,7 @@
>>>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>>>   static DEFINE_SPINLOCK(watchdog_lock);
>>>   static int watchdog_running;
>>> +static unsigned long watchdog_start_time;
>>>   static atomic_t watchdog_reset_pending;
>>>   static inline void clocksource_watchdog_lock(unsigned long *flags)
>>> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>       int next_cpu, reset_pending;
>>>       int64_t wd_nsec, cs_nsec;
>>>       struct clocksource *cs;
>>> +    unsigned long max_jiffies;
>>>       u32 md;
>>>       spin_lock(&watchdog_lock);
>>> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>           if (atomic_read(&watchdog_reset_pending))
>>>               continue;
>>> +        max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
>>> +        if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
>>> +            continue;
>>> +
> 
> Hi John:
> What do you think of this suggest?If the interval between two executions of the function clocksource_watchdog() exceeds max_idle_ns. We think the current judement is unreasonable,so we skip this judgment.

I feel that there are still some things that need to be discussed. This
solution may still fail in some scenarios. Assume that tick_do_timer_cpu
is CPU1 and clocksource watchdog is CPU2. At a certain point in time, CPU1
updates jiffies, and then the interrupt is closed for some reason, then
jiffies will not be updated. At the same time, the clocksource watchdog of
CPU2 is activated, and still delayed for a period of time due to high load.
Since the jiffies is not updated, this judgment should fail at this time. 
But I think it might be necessary to troubleshoot other problems, because
the interrupt should not closed for a long time. How do you think about
this scene.

Thanks.
>>>           /* Check the deviation from the watchdog clocksource. */
>>>           md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>           if (abs(cs_nsec - wd_nsec) > md) {
>>> @@ -474,6 +480,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>        * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>>>        */
>>>       if (!timer_pending(&watchdog_timer)) {
>>> +        watchdog_start_time = jiffies;
>>>           watchdog_timer.expires += WATCHDOG_INTERVAL;
>>>           add_timer_on(&watchdog_timer, next_cpu);
>>>       }
>>> @@ -488,6 +495,7 @@ static inline void clocksource_start_watchdog(void)
>>>       timer_setup(&watchdog_timer, clocksource_watchdog, 0);
>>>       watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
>>>       add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask));
>>> +    watchdog_start_time = jiffies;
>>>       watchdog_running = 1;
>>>   }
>>>
>>>
>>>> thanks
>>>> -john
>>>>
>>
>> It looks good to me.
>> thanks.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  5:02         ` John Stultz
@ 2021-10-18 10:41           ` yanghui
  0 siblings, 0 replies; 29+ messages in thread
From: yanghui @ 2021-10-18 10:41 UTC (permalink / raw)
  To: John Stultz; +Cc: Thomas Gleixner, Stephen Boyd, lkml, shli



在 2021/10/12 下午1:02, John Stultz 写道:
> On Sat, Oct 9, 2021 at 2:02 AM yanghui <yanghui.def@bytedance.com> wrote:
>>
>>
>>
>> 在 2021/10/9 上午11:38, John Stultz 写道:
>>> On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@bytedance.com> wrote:
>>>> 在 2021/10/9 上午7:45, John Stultz 写道:
>>>>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@bytedance.com> wrote:
>>>>>>
>>>>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
>>>>>> Timer. But sometimes system is very busy and the Timer cannot be
>>>>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
>>>>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
>>>>>> be enlarged. Then the current clocksource will be misjudged as
>>>>>> unstable. So we add conditions to prevent the clocksource from
>>>>>> being misjudged.
>>>>>>
>>>>>> Signed-off-by: yanghui <yanghui.def@bytedance.com>
>>>>>> ---
>>>>>>     kernel/time/clocksource.c | 6 +++++-
>>>>>>     1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>>>> index b8a14d2fb5ba..d535beadcbc8 100644
>>>>>> --- a/kernel/time/clocksource.c
>>>>>> +++ b/kernel/time/clocksource.c
>>>>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
>>>>>>
>>>>>>     /*
>>>>>>      * Interval: 0.5sec.
>>>>>> + * MaxInterval: 1s.
>>>>>>      */
>>>>>>     #define WATCHDOG_INTERVAL (HZ >> 1)
>>>>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
>>>>>>
>>>>>>     static void clocksource_watchdog_work(struct work_struct *work)
>>>>>>     {
>>>>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>>>
>>>>>>                    /* Check the deviation from the watchdog clocksource. */
>>>>>>                    md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>>>>>> -               if (abs(cs_nsec - wd_nsec) > md) {
>>>>>> +               if ((abs(cs_nsec - wd_nsec) > md) &&
>>>>>> +                       cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
>>>>>
>>>>> Sorry, it's been awhile since I looked at this code, but why are you
>>>>> bounding the clocksource delta here?
>>>>> It seems like if the clocksource being watched was very wrong (with a
>>>>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
>>>>>
>>>>>> +                       wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
>>>>>
>>>>> Bounding the watchdog interval on the check does seem reasonable.
>>>>> Though one may want to keep track that if we are seeing too many of
>>>>> these delayed watchdog checks we provide some feedback via dmesg.
>>>>
>>>>      Yes, only to check watchdog delta is more reasonable.
>>>>      I think Only have dmesg is not enough, because if tsc was be misjudged
>>>>      as unstable then switch to hpet. And hpet is very expensive for
>>>>      performance, so if we want to switch to tsc the only way is to reboot
>>>>      the server. We need to prevent the switching of the clock source in
>>>>      case of misjudgment.
>>>>      Circumstances of misjudgment:
>>>>      if clocksource_watchdog is executed after 10sec, the value of wd_delta
>>>>      and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
>>>>      will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
>>>
>>> Yea, it might be worth calculating an error rate instead of assuming
>>> the interval is fixed, but also just skipping the check may be
>>> reasonable assuming timers aren't constantly being delayed (and it's
>>> more of a transient state).
>>>
>>> At some point if the watchdog timer is delayed too much, the watchdog
>> I mean the execution cycle of this function(static void
>> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>>
>>> hardware will fully wrap and one can no longer properly compare
>>> intervals. That's why the timer length is chosen as such, so having
>>> that timer delayed is really pushing the system into a potentially bad
>>> state where other subtle problems are likely to crop up.
>>>
>>> So I do worry these watchdog robustness fixes are papering over a
>>> problem, pushing expectations closer to the edge of how far the system
>>> should tolerate bad behavior. Because at some point we'll fall off. :)
>>
>> Sorry,I don't seem to understand what you mean. Should I send your Patch
>> v2 ?
> 
> Sending a v2 is usually a good step (persistence is key! :)
> 
> I'm sorry for being unclear in the above. I'm mostly just fretting
> that the watchdog logic has inherent assumptions that the timers won't
> be greatly delayed. Unfortunately the reality is that the timers may
> be delayed. So we can try to add some robustness (as your patch does),
> but at a certain point, the delays may exceed what the logic can
> tolerate and produce correct behavior. I worry that by pushing the
> robustness up to that limit, folks may not recognize the problematic
> behavior (greatly delayed timers - possibly caused by drivers
> disabling irqs for too long, or bad SMI logic, or long virtualization
> pauses), and think the system is still working as designed, even

I think we can increase the value of WATCHDOG_MAX_INTERVAL_NS up to
20sec(soft lockup time) or more longer. So we can filter those timer 
delays caused by non-softlockup as your said(drivers disabling irq, bad
SMI logic ...).
I think this method can solve the problem that the softlock is
too long and the clocksource is incorrectly switched, resulting
in performance degradation.
> though its regularly exceeding the bounds of the assumptions in the
> code. So without any feedback that something is wrong, those bounds
> will continue to be pushed until things really break in a way we
> cannot be robust about.
> 
> That's why I was suggesting adding some sort of printk warning when we
> do see a number of delayed timers so that folks have some signal that
> things are not as they are expected to be.
> 
> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-12  8:06           ` brookxu
  2021-10-14  7:03             ` yanghui
@ 2021-10-18 16:14             ` John Stultz
  2021-10-19  4:14               ` yanghui
  1 sibling, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-18 16:14 UTC (permalink / raw)
  To: brookxu; +Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml

On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> John Stultz wrote on 2021/10/12 13:29:
> > On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> >> John Stultz wrote on 2021/10/12 12:52 下午:
> >>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> >> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> >> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> >> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> >> half of the watchdog timer max_cycles, then we can safely ignore the current
> >> verification? Do you think this idea is okay?
> >
> > I can't say I totally understand the idea. Maybe could you clarify with a patch?
> >
>
> Sorry, it looks almost as follows:
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2..87f3b67 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -119,6 +119,7 @@
>  static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
> +static unsigned long watchdog_start_time;
>  static atomic_t watchdog_reset_pending;
>
>  static inline void clocksource_watchdog_lock(unsigned long *flags)
> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>         int next_cpu, reset_pending;
>         int64_t wd_nsec, cs_nsec;
>         struct clocksource *cs;
> +       unsigned long max_jiffies;
>         u32 md;
>
>         spin_lock(&watchdog_lock);
> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
>                 if (atomic_read(&watchdog_reset_pending))
>                         continue;
>
> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> +                       continue;
> +

Sorry, what is the benefit of using jiffies here?   Jiffies are
updated by counting the number of tick intervals on the current
clocksource.

This seems like circular logic, where we're trying to judge the
current clocksource by using something we derived from the current
clocksource.
That's why the watchdog clocksource is important, as it's supposed to
be a separate counter that is more reliable (but likely slower) then
the preferred clocksource.

So I'm not really sure how this helps.

The earlier patch by yanghui at least used the watchdog interval to
decide if the watchdog timer had expired late. Which seemed
reasonable, but I thought it might be helpful to add some sort of a
counter so if the case is happening repeatedly (timers constantly
being delayed) we have a better signal that the watchdog and current
clocksource are out of sync.  Because again, timers are fired based on
the current clocksource. So constant delays likely mean things are
wrong.

thanks
-john

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-18 16:14             ` John Stultz
@ 2021-10-19  4:14               ` yanghui
  2021-10-19  5:00                 ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: yanghui @ 2021-10-19  4:14 UTC (permalink / raw)
  To: John Stultz, brookxu; +Cc: Thomas Gleixner, Stephen Boyd, lkml, shli



在 2021/10/19 上午12:14, John Stultz 写道:
> On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
>> John Stultz wrote on 2021/10/12 13:29:
>>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
>>>> John Stultz wrote on 2021/10/12 12:52 下午:
>>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
>>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
>>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
>>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
>>>> half of the watchdog timer max_cycles, then we can safely ignore the current
>>>> verification? Do you think this idea is okay?
>>>
>>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
>>>
>>
>> Sorry, it looks almost as follows:
>>
>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index b8a14d2..87f3b67 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -119,6 +119,7 @@
>>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>>   static DEFINE_SPINLOCK(watchdog_lock);
>>   static int watchdog_running;
>> +static unsigned long watchdog_start_time;
>>   static atomic_t watchdog_reset_pending;
>>
>>   static inline void clocksource_watchdog_lock(unsigned long *flags)
>> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>          int next_cpu, reset_pending;
>>          int64_t wd_nsec, cs_nsec;
>>          struct clocksource *cs;
>> +       unsigned long max_jiffies;
>>          u32 md;
>>
>>          spin_lock(&watchdog_lock);
>> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
>>                  if (atomic_read(&watchdog_reset_pending))
>>                          continue;
>>
>> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
>> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
>> +                       continue;
>> +
> 
> Sorry, what is the benefit of using jiffies here?   Jiffies are
> updated by counting the number of tick intervals on the current
> clocksource.
> 
> This seems like circular logic, where we're trying to judge the
> current clocksource by using something we derived from the current
> clocksource.
> That's why the watchdog clocksource is important, as it's supposed to
> be a separate counter that is more reliable (but likely slower) then
> the preferred clocksource.
> 
> So I'm not really sure how this helps.
> 
> The earlier patch by yanghui at least used the watchdog interval to
> decide if the watchdog timer had expired late. Which seemed
> reasonable, but I thought it might be helpful to add some sort of a
> counter so if the case is happening repeatedly (timers constantly
> being delayed) we have a better signal that the watchdog and current
> clocksource are out of sync.  Because again, timers are fired based on

I think only have a signal ls not enough. we need to prevent
clocksource from being incorrectly switched.

The Timer callback function clocksource_watchdog() is executed in the
context of softirq(run_timer_softirq()). So if softirq is disabled for
long time(One situation is long time softlockup), clocksource_watchdog()
will be delay executed.
> the current clocksource. So constant delays likely mean things are
> wrong.
> > thanks
> -john
> 
> thanks
> -john
> 
I think it will be better to add this to my patch:
  /*
   * Interval: 0.5sec.
- * MaxInterval: 1s.
+ * MaxInterval: 20s.
   */
  #define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
+#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)

thanks



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-19  4:14               ` yanghui
@ 2021-10-19  5:00                 ` John Stultz
  2021-10-20 10:09                   ` Luming Yu
  0 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2021-10-19  5:00 UTC (permalink / raw)
  To: yanghui; +Cc: brookxu, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> 在 2021/10/19 上午12:14, John Stultz 写道:
> > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> >> John Stultz wrote on 2021/10/12 13:29:
> >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> >>>> verification? Do you think this idea is okay?
> >>>
> >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> >>>
> >>
> >> Sorry, it looks almost as follows:
> >>
> >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >> index b8a14d2..87f3b67 100644
> >> --- a/kernel/time/clocksource.c
> >> +++ b/kernel/time/clocksource.c
> >> @@ -119,6 +119,7 @@
> >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> >>   static DEFINE_SPINLOCK(watchdog_lock);
> >>   static int watchdog_running;
> >> +static unsigned long watchdog_start_time;
> >>   static atomic_t watchdog_reset_pending;
> >>
> >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>          int next_cpu, reset_pending;
> >>          int64_t wd_nsec, cs_nsec;
> >>          struct clocksource *cs;
> >> +       unsigned long max_jiffies;
> >>          u32 md;
> >>
> >>          spin_lock(&watchdog_lock);
> >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>                  if (atomic_read(&watchdog_reset_pending))
> >>                          continue;
> >>
> >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> >> +                       continue;
> >> +
> >
> > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > updated by counting the number of tick intervals on the current
> > clocksource.
> >
> > This seems like circular logic, where we're trying to judge the
> > current clocksource by using something we derived from the current
> > clocksource.
> > That's why the watchdog clocksource is important, as it's supposed to
> > be a separate counter that is more reliable (but likely slower) then
> > the preferred clocksource.
> >
> > So I'm not really sure how this helps.
> >
> > The earlier patch by yanghui at least used the watchdog interval to
> > decide if the watchdog timer had expired late. Which seemed
> > reasonable, but I thought it might be helpful to add some sort of a
> > counter so if the case is happening repeatedly (timers constantly
> > being delayed) we have a better signal that the watchdog and current
> > clocksource are out of sync.  Because again, timers are fired based on
>
> I think only have a signal ls not enough. we need to prevent
> clocksource from being incorrectly switched.

Right, but we also have to ensure that we also properly disqualify
clocksources that are misbehaving.

In the case that the current clocksource is running very slow (imagine
old TSCs that lowered freq with cpufreq), then system time slows down,
so timers fire late.
So it would constantly seem like the irqs are being delayed, so with
your logic we would not disqualify a clearly malfunctioning
clocksource..

> The Timer callback function clocksource_watchdog() is executed in the
> context of softirq(run_timer_softirq()). So if softirq is disabled for
> long time(One situation is long time softlockup), clocksource_watchdog()
> will be delay executed.

Yes. The reality is that timers are often spuriously delayed. We don't
want a short burst of timer misbehavior to disqualify a good
clocksource.

But the problem is that this situation and the one above (with the
freq changing TSC), will look exactly the same.

So having a situation where if the watchdog clocksource thinks too
much time has passed between watchdog timers, we can skip judgement,
assuming its a spurious delay. But I think we need to keep a counter
so that if this happens 3-5 times in a row, we stop ignoring the
misbehavior and judge the current clocksource, as it may be running
slowly.

> >
> I think it will be better to add this to my patch:
>   /*
>    * Interval: 0.5sec.
> - * MaxInterval: 1s.
> + * MaxInterval: 20s.
>    */
>   #define WATCHDOG_INTERVAL (HZ >> 1)
> -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
>

Some watchdog counters wrap within 20 seconds, so I don't think this
is a good idea.

The other proposal to calculate the error rate, rather than a fixed
error boundary might be useful too, as if the current clocksource and
watchdog are close, a long timer delay won't disqualify them if we
scale the error bounds to be within an given error rate.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-19  5:00                 ` John Stultz
@ 2021-10-20 10:09                   ` Luming Yu
  2021-10-20 17:49                     ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Luming Yu @ 2021-10-20 10:09 UTC (permalink / raw)
  To: John Stultz
  Cc: yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li, paulmck

On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
>
> On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > >> John Stultz wrote on 2021/10/12 13:29:
> > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > >>>> verification? Do you think this idea is okay?
> > >>>
> > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > >>>
> > >>
> > >> Sorry, it looks almost as follows:
> > >>
> > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > >> index b8a14d2..87f3b67 100644
> > >> --- a/kernel/time/clocksource.c
> > >> +++ b/kernel/time/clocksource.c
> > >> @@ -119,6 +119,7 @@
> > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > >>   static int watchdog_running;
> > >> +static unsigned long watchdog_start_time;
> > >>   static atomic_t watchdog_reset_pending;
> > >>
> > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >>          int next_cpu, reset_pending;
> > >>          int64_t wd_nsec, cs_nsec;
> > >>          struct clocksource *cs;
> > >> +       unsigned long max_jiffies;
> > >>          u32 md;
> > >>
> > >>          spin_lock(&watchdog_lock);
> > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >>                  if (atomic_read(&watchdog_reset_pending))
> > >>                          continue;
> > >>
> > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > >> +                       continue;
> > >> +
> > >
> > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > updated by counting the number of tick intervals on the current
> > > clocksource.
> > >
> > > This seems like circular logic, where we're trying to judge the
> > > current clocksource by using something we derived from the current
> > > clocksource.
> > > That's why the watchdog clocksource is important, as it's supposed to
> > > be a separate counter that is more reliable (but likely slower) then
> > > the preferred clocksource.
> > >
> > > So I'm not really sure how this helps.
> > >
> > > The earlier patch by yanghui at least used the watchdog interval to
> > > decide if the watchdog timer had expired late. Which seemed
> > > reasonable, but I thought it might be helpful to add some sort of a
> > > counter so if the case is happening repeatedly (timers constantly
> > > being delayed) we have a better signal that the watchdog and current
> > > clocksource are out of sync.  Because again, timers are fired based on
> >
> > I think only have a signal ls not enough. we need to prevent
> > clocksource from being incorrectly switched.
>
> Right, but we also have to ensure that we also properly disqualify
> clocksources that are misbehaving.
>
> In the case that the current clocksource is running very slow (imagine
> old TSCs that lowered freq with cpufreq), then system time slows down,
> so timers fire late.
> So it would constantly seem like the irqs are being delayed, so with
> your logic we would not disqualify a clearly malfunctioning
> clocksource..
>
> > The Timer callback function clocksource_watchdog() is executed in the
> > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > long time(One situation is long time softlockup), clocksource_watchdog()
> > will be delay executed.
>
> Yes. The reality is that timers are often spuriously delayed. We don't
> want a short burst of timer misbehavior to disqualify a good
> clocksource.
>
> But the problem is that this situation and the one above (with the
> freq changing TSC), will look exactly the same.
>
> So having a situation where if the watchdog clocksource thinks too
> much time has passed between watchdog timers, we can skip judgement,
> assuming its a spurious delay. But I think we need to keep a counter
> so that if this happens 3-5 times in a row, we stop ignoring the
> misbehavior and judge the current clocksource, as it may be running
> slowly.
>
> > >
> > I think it will be better to add this to my patch:
> >   /*
> >    * Interval: 0.5sec.
> > - * MaxInterval: 1s.
> > + * MaxInterval: 20s.
> >    */
> >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> >
>
> Some watchdog counters wrap within 20 seconds, so I don't think this
> is a good idea.
>
> The other proposal to calculate the error rate, rather than a fixed
> error boundary might be useful too, as if the current clocksource and
> watchdog are close, a long timer delay won't disqualify them if we
> scale the error bounds to be within an given error rate.

In most of tsc unstable trouble shooting on modern servers we experienced,
it usually ends up in a false alarm triggered by the  clock source
watchdog for tsc.

I think Paul has a proposal to make a clock source watchdog to be more
intelligent.
Its job is to find a real problem instead of causing a problem.

so disabling it for known good-tsc might be a reasonable good idea
that can save manpower for other
more valuable problems to solve, or at least make it statistically a
problem less chance to happen.
>
> thanks
> -john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-20 10:09                   ` Luming Yu
@ 2021-10-20 17:49                     ` Paul E. McKenney
  2021-10-21  9:37                       ` Luming Yu
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2021-10-20 17:49 UTC (permalink / raw)
  To: Luming Yu
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> >
> > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > >> John Stultz wrote on 2021/10/12 13:29:
> > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > >>>> verification? Do you think this idea is okay?
> > > >>>
> > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > >>>
> > > >>
> > > >> Sorry, it looks almost as follows:
> > > >>
> > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > >> index b8a14d2..87f3b67 100644
> > > >> --- a/kernel/time/clocksource.c
> > > >> +++ b/kernel/time/clocksource.c
> > > >> @@ -119,6 +119,7 @@
> > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > >>   static int watchdog_running;
> > > >> +static unsigned long watchdog_start_time;
> > > >>   static atomic_t watchdog_reset_pending;
> > > >>
> > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >>          int next_cpu, reset_pending;
> > > >>          int64_t wd_nsec, cs_nsec;
> > > >>          struct clocksource *cs;
> > > >> +       unsigned long max_jiffies;
> > > >>          u32 md;
> > > >>
> > > >>          spin_lock(&watchdog_lock);
> > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > >>                          continue;
> > > >>
> > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > >> +                       continue;
> > > >> +
> > > >
> > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > updated by counting the number of tick intervals on the current
> > > > clocksource.
> > > >
> > > > This seems like circular logic, where we're trying to judge the
> > > > current clocksource by using something we derived from the current
> > > > clocksource.
> > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > be a separate counter that is more reliable (but likely slower) then
> > > > the preferred clocksource.
> > > >
> > > > So I'm not really sure how this helps.
> > > >
> > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > decide if the watchdog timer had expired late. Which seemed
> > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > counter so if the case is happening repeatedly (timers constantly
> > > > being delayed) we have a better signal that the watchdog and current
> > > > clocksource are out of sync.  Because again, timers are fired based on
> > >
> > > I think only have a signal ls not enough. we need to prevent
> > > clocksource from being incorrectly switched.
> >
> > Right, but we also have to ensure that we also properly disqualify
> > clocksources that are misbehaving.
> >
> > In the case that the current clocksource is running very slow (imagine
> > old TSCs that lowered freq with cpufreq), then system time slows down,
> > so timers fire late.
> > So it would constantly seem like the irqs are being delayed, so with
> > your logic we would not disqualify a clearly malfunctioning
> > clocksource..
> >
> > > The Timer callback function clocksource_watchdog() is executed in the
> > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > will be delay executed.
> >
> > Yes. The reality is that timers are often spuriously delayed. We don't
> > want a short burst of timer misbehavior to disqualify a good
> > clocksource.
> >
> > But the problem is that this situation and the one above (with the
> > freq changing TSC), will look exactly the same.
> >
> > So having a situation where if the watchdog clocksource thinks too
> > much time has passed between watchdog timers, we can skip judgement,
> > assuming its a spurious delay. But I think we need to keep a counter
> > so that if this happens 3-5 times in a row, we stop ignoring the
> > misbehavior and judge the current clocksource, as it may be running
> > slowly.
> >
> > > >
> > > I think it will be better to add this to my patch:
> > >   /*
> > >    * Interval: 0.5sec.
> > > - * MaxInterval: 1s.
> > > + * MaxInterval: 20s.
> > >    */
> > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > >
> >
> > Some watchdog counters wrap within 20 seconds, so I don't think this
> > is a good idea.
> >
> > The other proposal to calculate the error rate, rather than a fixed
> > error boundary might be useful too, as if the current clocksource and
> > watchdog are close, a long timer delay won't disqualify them if we
> > scale the error bounds to be within an given error rate.
> 
> In most of tsc unstable trouble shooting on modern servers we experienced,
> it usually ends up in a false alarm triggered by the  clock source
> watchdog for tsc.
> 
> I think Paul has a proposal to make a clock source watchdog to be more
> intelligent.
> Its job is to find a real problem instead of causing a problem.

And that proposal is now in mainline:

22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
2e27e793e280 clocksource: Reduce clocksource-skew threshold
fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
db3a34e17433 clocksource: Retry clock read if long delays detected

The strategy is to disqualify a clock comparison if the reads took too
long, and to retry the comparison.  If four consecutive comparison take
too long, clock skew is reported.  The number of consecutive comparisons
may be adjusted by the usual kernel boot parameter.

> so disabling it for known good-tsc might be a reasonable good idea
> that can save manpower for other
> more valuable problems to solve, or at least make it statistically a
> problem less chance to happen.

One additional piece that is still in prototype state in -rcu is to give
clocksources some opportunity to resynchronize if there are too many
slow comparisons.  This is intended to handle cases where clocks often
take hundreds of microseconds to read out due to cache-buster workloads
interfering with all and sundry.  Testing indicates that even in the
midst of such cache-busting, there will almost certainly be at least one
fast comparison within a five-second period.  So the current thought is
to give a clocksource 100 seconds in which to get a fast comparison.

Those who have only one clock always know what time it is.  Those with
more than one can never be certain.  Except that the universe is
well-supplied with a great many clocks outside of the computer system
in question.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-20 17:49                     ` Paul E. McKenney
@ 2021-10-21  9:37                       ` Luming Yu
  2021-10-22 23:36                         ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Luming Yu @ 2021-10-21  9:37 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > >
> > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > >>>> verification? Do you think this idea is okay?
> > > > >>>
> > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > >>>
> > > > >>
> > > > >> Sorry, it looks almost as follows:
> > > > >>
> > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > >> index b8a14d2..87f3b67 100644
> > > > >> --- a/kernel/time/clocksource.c
> > > > >> +++ b/kernel/time/clocksource.c
> > > > >> @@ -119,6 +119,7 @@
> > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > >>   static int watchdog_running;
> > > > >> +static unsigned long watchdog_start_time;
> > > > >>   static atomic_t watchdog_reset_pending;
> > > > >>
> > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >>          int next_cpu, reset_pending;
> > > > >>          int64_t wd_nsec, cs_nsec;
> > > > >>          struct clocksource *cs;
> > > > >> +       unsigned long max_jiffies;
> > > > >>          u32 md;
> > > > >>
> > > > >>          spin_lock(&watchdog_lock);
> > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > >>                          continue;
> > > > >>
> > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > >> +                       continue;
> > > > >> +
> > > > >
> > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > updated by counting the number of tick intervals on the current
> > > > > clocksource.
> > > > >
> > > > > This seems like circular logic, where we're trying to judge the
> > > > > current clocksource by using something we derived from the current
> > > > > clocksource.
> > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > the preferred clocksource.
> > > > >
> > > > > So I'm not really sure how this helps.
> > > > >
> > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > being delayed) we have a better signal that the watchdog and current
> > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > >
> > > > I think only have a signal ls not enough. we need to prevent
> > > > clocksource from being incorrectly switched.
> > >
> > > Right, but we also have to ensure that we also properly disqualify
> > > clocksources that are misbehaving.
> > >
> > > In the case that the current clocksource is running very slow (imagine
> > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > so timers fire late.
> > > So it would constantly seem like the irqs are being delayed, so with
> > > your logic we would not disqualify a clearly malfunctioning
> > > clocksource..
> > >
> > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > will be delay executed.
> > >
> > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > want a short burst of timer misbehavior to disqualify a good
> > > clocksource.
> > >
> > > But the problem is that this situation and the one above (with the
> > > freq changing TSC), will look exactly the same.
> > >
> > > So having a situation where if the watchdog clocksource thinks too
> > > much time has passed between watchdog timers, we can skip judgement,
> > > assuming its a spurious delay. But I think we need to keep a counter
> > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > misbehavior and judge the current clocksource, as it may be running
> > > slowly.
> > >
> > > > >
> > > > I think it will be better to add this to my patch:
> > > >   /*
> > > >    * Interval: 0.5sec.
> > > > - * MaxInterval: 1s.
> > > > + * MaxInterval: 20s.
> > > >    */
> > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > >
> > >
> > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > is a good idea.
> > >
> > > The other proposal to calculate the error rate, rather than a fixed
> > > error boundary might be useful too, as if the current clocksource and
> > > watchdog are close, a long timer delay won't disqualify them if we
> > > scale the error bounds to be within an given error rate.
> >
> > In most of tsc unstable trouble shooting on modern servers we experienced,
> > it usually ends up in a false alarm triggered by the  clock source
> > watchdog for tsc.
> >
> > I think Paul has a proposal to make a clock source watchdog to be more
> > intelligent.
> > Its job is to find a real problem instead of causing a problem.
>
> And that proposal is now in mainline:

Great! : -)
>
> 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> db3a34e17433 clocksource: Retry clock read if long delays detected
>
> The strategy is to disqualify a clock comparison if the reads took too
> long, and to retry the comparison.  If four consecutive comparison take
> too long, clock skew is reported.  The number of consecutive comparisons
> may be adjusted by the usual kernel boot parameter.
>
> > so disabling it for known good-tsc might be a reasonable good idea
> > that can save manpower for other
> > more valuable problems to solve, or at least make it statistically a
> > problem less chance to happen.
>
> One additional piece that is still in prototype state in -rcu is to give
> clocksources some opportunity to resynchronize if there are too many
> slow comparisons.  This is intended to handle cases where clocks often

if there is such tsc-sync algorithm existing in software, it really
can help system software engineers
to solve some rare power good signals synchronization problem caused
by bios that caused
boot time tsc sync check failure that usually would consume huge
debugging engine for bringing up qualified linux system.

Less depending on platform quirks should be good thing to linux for
tsc && rcu support.

> take hundreds of microseconds to read out due to cache-buster workloads
> interfering with all and sundry.  Testing indicates that even in the
> midst of such cache-busting, there will almost certainly be at least one
> fast comparison within a five-second period.  So the current thought is
> to give a clocksource 100 seconds in which to get a fast comparison.
>
> Those who have only one clock always know what time it is.  Those with
> more than one can never be certain.  Except that the universe is
> well-supplied with a great many clocks outside of the computer system
> in question.  ;-)
>
>                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-21  9:37                       ` Luming Yu
@ 2021-10-22 23:36                         ` Paul E. McKenney
  2021-11-01  9:59                           ` Luming Yu
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2021-10-22 23:36 UTC (permalink / raw)
  To: Luming Yu
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > >
> > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > >>>> verification? Do you think this idea is okay?
> > > > > >>>
> > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > >>>
> > > > > >>
> > > > > >> Sorry, it looks almost as follows:
> > > > > >>
> > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > >> index b8a14d2..87f3b67 100644
> > > > > >> --- a/kernel/time/clocksource.c
> > > > > >> +++ b/kernel/time/clocksource.c
> > > > > >> @@ -119,6 +119,7 @@
> > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > >>   static int watchdog_running;
> > > > > >> +static unsigned long watchdog_start_time;
> > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > >>
> > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > >>          int next_cpu, reset_pending;
> > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > >>          struct clocksource *cs;
> > > > > >> +       unsigned long max_jiffies;
> > > > > >>          u32 md;
> > > > > >>
> > > > > >>          spin_lock(&watchdog_lock);
> > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > >>                          continue;
> > > > > >>
> > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > >> +                       continue;
> > > > > >> +
> > > > > >
> > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > updated by counting the number of tick intervals on the current
> > > > > > clocksource.
> > > > > >
> > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > current clocksource by using something we derived from the current
> > > > > > clocksource.
> > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > the preferred clocksource.
> > > > > >
> > > > > > So I'm not really sure how this helps.
> > > > > >
> > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > >
> > > > > I think only have a signal ls not enough. we need to prevent
> > > > > clocksource from being incorrectly switched.
> > > >
> > > > Right, but we also have to ensure that we also properly disqualify
> > > > clocksources that are misbehaving.
> > > >
> > > > In the case that the current clocksource is running very slow (imagine
> > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > so timers fire late.
> > > > So it would constantly seem like the irqs are being delayed, so with
> > > > your logic we would not disqualify a clearly malfunctioning
> > > > clocksource..
> > > >
> > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > will be delay executed.
> > > >
> > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > want a short burst of timer misbehavior to disqualify a good
> > > > clocksource.
> > > >
> > > > But the problem is that this situation and the one above (with the
> > > > freq changing TSC), will look exactly the same.
> > > >
> > > > So having a situation where if the watchdog clocksource thinks too
> > > > much time has passed between watchdog timers, we can skip judgement,
> > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > misbehavior and judge the current clocksource, as it may be running
> > > > slowly.
> > > >
> > > > > >
> > > > > I think it will be better to add this to my patch:
> > > > >   /*
> > > > >    * Interval: 0.5sec.
> > > > > - * MaxInterval: 1s.
> > > > > + * MaxInterval: 20s.
> > > > >    */
> > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > >
> > > >
> > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > is a good idea.
> > > >
> > > > The other proposal to calculate the error rate, rather than a fixed
> > > > error boundary might be useful too, as if the current clocksource and
> > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > scale the error bounds to be within an given error rate.
> > >
> > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > it usually ends up in a false alarm triggered by the  clock source
> > > watchdog for tsc.
> > >
> > > I think Paul has a proposal to make a clock source watchdog to be more
> > > intelligent.
> > > Its job is to find a real problem instead of causing a problem.
> >
> > And that proposal is now in mainline:
> 
> Great! : -)
> >
> > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > db3a34e17433 clocksource: Retry clock read if long delays detected
> >
> > The strategy is to disqualify a clock comparison if the reads took too
> > long, and to retry the comparison.  If four consecutive comparison take
> > too long, clock skew is reported.  The number of consecutive comparisons
> > may be adjusted by the usual kernel boot parameter.
> >
> > > so disabling it for known good-tsc might be a reasonable good idea
> > > that can save manpower for other
> > > more valuable problems to solve, or at least make it statistically a
> > > problem less chance to happen.
> >
> > One additional piece that is still in prototype state in -rcu is to give
> > clocksources some opportunity to resynchronize if there are too many
> > slow comparisons.  This is intended to handle cases where clocks often
> 
> if there is such tsc-sync algorithm existing in software, it really
> can help system software engineers
> to solve some rare power good signals synchronization problem caused
> by bios that caused
> boot time tsc sync check failure that usually would consume huge
> debugging engine for bringing up qualified linux system.
> 
> Less depending on platform quirks should be good thing to linux for
> tsc && rcu support.

Good point, I have procrastinated long enough.

How about like this?

							Thanx, Paul

------------------------------------------------------------------------

commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu May 27 11:03:28 2021 -0700

    clocksource: Forgive repeated long-latency watchdog clocksource reads
    
    Currently, the clocksource watchdog reacts to repeated long-latency
    clocksource reads by marking that clocksource unstable on the theory that
    these long-latency reads are a sign of a serious problem.  And this theory
    does in fact have real-world support in the form of firmware issues [1].
    
    However, it is also possible to trigger this using stress-ng on what
    the stress-ng man page terms "poorly designed hardware" [2].  And it
    is not necessarily a bad thing for the kernel to diagnose cases where
    heavy memory-contention workloads are being run on hardware that is not
    designed for this sort of use.
    
    Nevertheless, it is quite possible that real-world use will result in
    some situation requiring that high-stress workloads run on hardware
    not designed to accommodate them, and also requiring that the kernel
    refrain from marking clocksources unstable.
    
    Therefore, react to persistent long-latency reads by leaving the
    clocksource alone, but using the old 62.5-millisecond skew-detection
    threshold.  In addition, the offending clocksource is marked for
    re-initialization, which both restarts that clocksource with a clean bill
    of health and avoids false-positive skew reports on later watchdog checks.
    Once marked for re-initialization, the clocksource is not subjected to
    further watchdog checking until a subsequent successful read from that
    clocksource that is free of excessive delays.
    
    However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
    attempts result in long latencies, a warning (splat) will be emitted.
    This kernel boot parameter defaults to 100, and this warning can be
    disabled by setting it to zero or to a negative value.
    
    [ paulmck: Apply feedback from Chao Gao ]
    
    Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
    Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
    Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
    Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
    Tested-by: Chao Gao <chao.gao@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 316027c3aadc..61d2436ae9df 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -600,6 +600,14 @@
 			loops can be debugged more effectively on production
 			systems.
 
+	clocksource.max_cswd_coarse_reads= [KNL]
+			Number of consecutive clocksource_watchdog()
+			coarse reads (that is, clocksource reads that
+			were unduly delayed) that are permitted before
+			the kernel complains (gently).	Set to a value
+			less than or equal to zero to suppress these
+			complaints.
+
 	clocksource.max_cswd_read_retries= [KNL]
 			Number of clocksource_watchdog() retries due to
 			external delays before the clock will be marked
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b17327..3e925d9ffc31 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -110,6 +110,7 @@ struct clocksource {
 	int			rating;
 	enum clocksource_ids	id;
 	enum vdso_clock_mode	vdso_clock_mode;
+	unsigned int		n_coarse_reads;
 	unsigned long		flags;
 
 	int			(*enable)(struct clocksource *cs);
@@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
 #define TIMER_ACPI_DECLARE(name, table_id, fn)		\
 	ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
 
+extern int max_cswd_coarse_reads;
 extern ulong max_cswd_read_retries;
 void clocksource_verify_percpu(struct clocksource *cs);
 
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index df922f49d171..7e82500c400b 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
 		else if (i <= max_cswd_read_retries)
 			s = ", expect message";
 		else
-			s = ", expect clock skew";
+			s = ", expect coarse-grained clock skew check and re-initialization";
 		pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
 		WRITE_ONCE(wdtest_ktime_read_ndelays, i);
 		schedule_timeout_uninterruptible(2 * HZ);
 		WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
-		WARN_ON_ONCE((i <= max_cswd_read_retries) !=
-			     !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+		WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
 		wdtest_ktime_clocksource_reset();
 	}
 
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..796a127aabb9 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
 	spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+int max_cswd_coarse_reads = 100;
+module_param(max_cswd_coarse_reads, int, 0644);
+EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
 ulong max_cswd_read_retries = 3;
 module_param(max_cswd_read_retries, ulong, 0644);
 EXPORT_SYMBOL_GPL(max_cswd_read_retries);
@@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
 				pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
 					smp_processor_id(), watchdog->name, nretries);
 			}
-			return true;
+			cs->n_coarse_reads = 0;
+			return false;
 		}
+		WARN_ONCE(max_cswd_coarse_reads > 0 &&
+			  !(++cs->n_coarse_reads % max_cswd_coarse_reads),
+			  "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
 	}
 
-	pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
-		smp_processor_id(), watchdog->name, wd_delay, nretries);
-	return false;
+	if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
+		pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
+			smp_processor_id(), watchdog->name, wd_delay, nretries);
+	} else {
+		pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
+			smp_processor_id(), watchdog->name, wd_delay, nretries);
+	}
+	return true;
 }
 
 static u64 csnow_mid;
@@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int next_cpu, reset_pending;
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
+	bool coarse;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 		}
 
-		if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
-			/* Clock readout unreliable, so give it up. */
-			__clocksource_unstable(cs);
-			continue;
-		}
+		coarse = cs_watchdog_read(cs, &csnow, &wdnow);
 
 		/* Clocksource initialized ? */
 		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
 		    atomic_read(&watchdog_reset_pending)) {
-			cs->flags |= CLOCK_SOURCE_WATCHDOG;
+			if (!coarse)
+				cs->flags |= CLOCK_SOURCE_WATCHDOG;
 			cs->wd_last = wdnow;
 			cs->cs_last = csnow;
 			continue;
@@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
 			continue;
 
 		/* Check the deviation from the watchdog clocksource. */
-		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+		if (coarse) {
+			md = 62500 * NSEC_PER_USEC;
+			cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+			pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
+		} else {
+			md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+		}
 		if (abs(cs_nsec - wd_nsec) > md) {
 			pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
 				smp_processor_id(), cs->name);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-22 23:36                         ` Paul E. McKenney
@ 2021-11-01  9:59                           ` Luming Yu
  2021-11-01 16:57                             ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Luming Yu @ 2021-11-01  9:59 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > > >
> > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > >>>
> > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > >>>
> > > > > > >>
> > > > > > >> Sorry, it looks almost as follows:
> > > > > > >>
> > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > >> @@ -119,6 +119,7 @@
> > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > >>   static int watchdog_running;
> > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > >>
> > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > >>          int next_cpu, reset_pending;
> > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > >>          struct clocksource *cs;
> > > > > > >> +       unsigned long max_jiffies;
> > > > > > >>          u32 md;
> > > > > > >>
> > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > >>                          continue;
> > > > > > >>
> > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > >> +                       continue;
> > > > > > >> +
> > > > > > >
> > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > clocksource.
> > > > > > >
> > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > current clocksource by using something we derived from the current
> > > > > > > clocksource.
> > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > the preferred clocksource.
> > > > > > >
> > > > > > > So I'm not really sure how this helps.
> > > > > > >
> > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > >
> > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > clocksource from being incorrectly switched.
> > > > >
> > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > clocksources that are misbehaving.
> > > > >
> > > > > In the case that the current clocksource is running very slow (imagine
> > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > so timers fire late.
> > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > clocksource..
> > > > >
> > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > will be delay executed.
> > > > >
> > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > clocksource.
> > > > >
> > > > > But the problem is that this situation and the one above (with the
> > > > > freq changing TSC), will look exactly the same.
> > > > >
> > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > slowly.
> > > > >
> > > > > > >
> > > > > > I think it will be better to add this to my patch:
> > > > > >   /*
> > > > > >    * Interval: 0.5sec.
> > > > > > - * MaxInterval: 1s.
> > > > > > + * MaxInterval: 20s.
> > > > > >    */
> > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > >
> > > > >
> > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > is a good idea.
> > > > >
> > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > error boundary might be useful too, as if the current clocksource and
> > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > scale the error bounds to be within an given error rate.
> > > >
> > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > it usually ends up in a false alarm triggered by the  clock source
> > > > watchdog for tsc.
> > > >
> > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > intelligent.
> > > > Its job is to find a real problem instead of causing a problem.
> > >
> > > And that proposal is now in mainline:
> >
> > Great! : -)
> > >
> > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > >
> > > The strategy is to disqualify a clock comparison if the reads took too
> > > long, and to retry the comparison.  If four consecutive comparison take
> > > too long, clock skew is reported.  The number of consecutive comparisons
> > > may be adjusted by the usual kernel boot parameter.
> > >
> > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > that can save manpower for other
> > > > more valuable problems to solve, or at least make it statistically a
> > > > problem less chance to happen.
> > >
> > > One additional piece that is still in prototype state in -rcu is to give
> > > clocksources some opportunity to resynchronize if there are too many
> > > slow comparisons.  This is intended to handle cases where clocks often
> >
> > if there is such tsc-sync algorithm existing in software, it really
> > can help system software engineers
> > to solve some rare power good signals synchronization problem caused
> > by bios that caused
> > boot time tsc sync check failure that usually would consume huge
> > debugging engine for bringing up qualified linux system.
> >
> > Less depending on platform quirks should be good thing to linux for
> > tsc && rcu support.
>
> Good point, I have procrastinated long enough.
>
> How about like this?

sorry, I meant a better algorithm to use tsc adjust register
like the tried one in  arch/x86/kernel/tsc_sync.c


>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Thu May 27 11:03:28 2021 -0700
>
>     clocksource: Forgive repeated long-latency watchdog clocksource reads
>
>     Currently, the clocksource watchdog reacts to repeated long-latency
>     clocksource reads by marking that clocksource unstable on the theory that
>     these long-latency reads are a sign of a serious problem.  And this theory

Maybe we need to use other core's tsc as a reference clock instead of
using HPET, which , to my knowledge ,  is the place where the problem happens.

Ruling out HPET and other slow clock devices as the obvious wrong choice
of a reference clock for  tsc,  I guess there will be less chance we
(in kernel code)
will get bothered by other latency problems perceived in the clock
source watchdog.

>     does in fact have real-world support in the form of firmware issues [1].
>
>     However, it is also possible to trigger this using stress-ng on what
>     the stress-ng man page terms "poorly designed hardware" [2].  And it
>     is not necessarily a bad thing for the kernel to diagnose cases where
>     heavy memory-contention workloads are being run on hardware that is not
>     designed for this sort of use.
>
>     Nevertheless, it is quite possible that real-world use will result in
>     some situation requiring that high-stress workloads run on hardware
>     not designed to accommodate them, and also requiring that the kernel
>     refrain from marking clocksources unstable.
>
>     Therefore, react to persistent long-latency reads by leaving the
>     clocksource alone, but using the old 62.5-millisecond skew-detection
>     threshold.  In addition, the offending clocksource is marked for
>     re-initialization, which both restarts that clocksource with a clean bill
>     of health and avoids false-positive skew reports on later watchdog checks.
>     Once marked for re-initialization, the clocksource is not subjected to
>     further watchdog checking until a subsequent successful read from that
>     clocksource that is free of excessive delays.
>
>     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
>     attempts result in long latencies, a warning (splat) will be emitted.
>     This kernel boot parameter defaults to 100, and this warning can be
>     disabled by setting it to zero or to a negative value.
>
>     [ paulmck: Apply feedback from Chao Gao ]
>
>     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
>     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
>     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
>     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
>     Tested-by: Chao Gao <chao.gao@intel.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 316027c3aadc..61d2436ae9df 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -600,6 +600,14 @@
>                         loops can be debugged more effectively on production
>                         systems.
>
> +       clocksource.max_cswd_coarse_reads= [KNL]
> +                       Number of consecutive clocksource_watchdog()
> +                       coarse reads (that is, clocksource reads that
> +                       were unduly delayed) that are permitted before
> +                       the kernel complains (gently).  Set to a value
> +                       less than or equal to zero to suppress these
> +                       complaints.
> +
>         clocksource.max_cswd_read_retries= [KNL]
>                         Number of clocksource_watchdog() retries due to
>                         external delays before the clock will be marked
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index 1d42d4b17327..3e925d9ffc31 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -110,6 +110,7 @@ struct clocksource {
>         int                     rating;
>         enum clocksource_ids    id;
>         enum vdso_clock_mode    vdso_clock_mode;
> +       unsigned int            n_coarse_reads;
>         unsigned long           flags;
>
>         int                     (*enable)(struct clocksource *cs);
> @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
>  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
>         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
>
> +extern int max_cswd_coarse_reads;
>  extern ulong max_cswd_read_retries;
>  void clocksource_verify_percpu(struct clocksource *cs);
>
> diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> index df922f49d171..7e82500c400b 100644
> --- a/kernel/time/clocksource-wdtest.c
> +++ b/kernel/time/clocksource-wdtest.c
> @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
>                 else if (i <= max_cswd_read_retries)
>                         s = ", expect message";
>                 else
> -                       s = ", expect clock skew";
> +                       s = ", expect coarse-grained clock skew check and re-initialization";
>                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
>                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
>                 schedule_timeout_uninterruptible(2 * HZ);
>                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
>                 wdtest_ktime_clocksource_reset();
>         }
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index b8a14d2fb5ba..796a127aabb9 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
>         spin_unlock_irqrestore(&watchdog_lock, flags);
>  }
>
> +int max_cswd_coarse_reads = 100;
> +module_param(max_cswd_coarse_reads, int, 0644);
> +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
>  ulong max_cswd_read_retries = 3;
>  module_param(max_cswd_read_retries, ulong, 0644);
>  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
>                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
>                                         smp_processor_id(), watchdog->name, nretries);
>                         }
> -                       return true;
> +                       cs->n_coarse_reads = 0;
> +                       return false;
>                 }
> +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
>         }
>
> -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> -       return false;
> +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> +       } else {
> +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> +       }
> +       return true;
>  }
>
>  static u64 csnow_mid;
> @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>         int next_cpu, reset_pending;
>         int64_t wd_nsec, cs_nsec;
>         struct clocksource *cs;
> +       bool coarse;
>         u32 md;
>
>         spin_lock(&watchdog_lock);
> @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
>                         continue;
>                 }
>
> -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> -                       /* Clock readout unreliable, so give it up. */
> -                       __clocksource_unstable(cs);
> -                       continue;
> -               }
> +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
>
>                 /* Clocksource initialized ? */
>                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>                     atomic_read(&watchdog_reset_pending)) {
> -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> +                       if (!coarse)
> +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
>                         cs->wd_last = wdnow;
>                         cs->cs_last = csnow;
>                         continue;
> @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
>                         continue;
>
>                 /* Check the deviation from the watchdog clocksource. */
> -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> +               if (coarse) {
> +                       md = 62500 * NSEC_PER_USEC;
> +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> +               } else {
> +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> +               }
>                 if (abs(cs_nsec - wd_nsec) > md) {
>                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
>                                 smp_processor_id(), cs->name);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-11-01  9:59                           ` Luming Yu
@ 2021-11-01 16:57                             ` Paul E. McKenney
  2021-11-03  8:27                               ` Luming Yu
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2021-11-01 16:57 UTC (permalink / raw)
  To: Luming Yu
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Mon, Nov 01, 2021 at 05:59:52AM -0400, Luming Yu wrote:
> On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > > > >
> > > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > > >>>
> > > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> Sorry, it looks almost as follows:
> > > > > > > >>
> > > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > > >> @@ -119,6 +119,7 @@
> > > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > > >>   static int watchdog_running;
> > > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > > >>
> > > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > >>          int next_cpu, reset_pending;
> > > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > > >>          struct clocksource *cs;
> > > > > > > >> +       unsigned long max_jiffies;
> > > > > > > >>          u32 md;
> > > > > > > >>
> > > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > > >>                          continue;
> > > > > > > >>
> > > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > > >> +                       continue;
> > > > > > > >> +
> > > > > > > >
> > > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > > clocksource.
> > > > > > > >
> > > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > > current clocksource by using something we derived from the current
> > > > > > > > clocksource.
> > > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > > the preferred clocksource.
> > > > > > > >
> > > > > > > > So I'm not really sure how this helps.
> > > > > > > >
> > > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > > >
> > > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > > clocksource from being incorrectly switched.
> > > > > >
> > > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > > clocksources that are misbehaving.
> > > > > >
> > > > > > In the case that the current clocksource is running very slow (imagine
> > > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > > so timers fire late.
> > > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > > clocksource..
> > > > > >
> > > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > > will be delay executed.
> > > > > >
> > > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > > clocksource.
> > > > > >
> > > > > > But the problem is that this situation and the one above (with the
> > > > > > freq changing TSC), will look exactly the same.
> > > > > >
> > > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > > slowly.
> > > > > >
> > > > > > > >
> > > > > > > I think it will be better to add this to my patch:
> > > > > > >   /*
> > > > > > >    * Interval: 0.5sec.
> > > > > > > - * MaxInterval: 1s.
> > > > > > > + * MaxInterval: 20s.
> > > > > > >    */
> > > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > > >
> > > > > >
> > > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > > is a good idea.
> > > > > >
> > > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > > error boundary might be useful too, as if the current clocksource and
> > > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > > scale the error bounds to be within an given error rate.
> > > > >
> > > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > > it usually ends up in a false alarm triggered by the  clock source
> > > > > watchdog for tsc.
> > > > >
> > > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > > intelligent.
> > > > > Its job is to find a real problem instead of causing a problem.
> > > >
> > > > And that proposal is now in mainline:
> > >
> > > Great! : -)
> > > >
> > > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > > >
> > > > The strategy is to disqualify a clock comparison if the reads took too
> > > > long, and to retry the comparison.  If four consecutive comparison take
> > > > too long, clock skew is reported.  The number of consecutive comparisons
> > > > may be adjusted by the usual kernel boot parameter.
> > > >
> > > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > > that can save manpower for other
> > > > > more valuable problems to solve, or at least make it statistically a
> > > > > problem less chance to happen.
> > > >
> > > > One additional piece that is still in prototype state in -rcu is to give
> > > > clocksources some opportunity to resynchronize if there are too many
> > > > slow comparisons.  This is intended to handle cases where clocks often
> > >
> > > if there is such tsc-sync algorithm existing in software, it really
> > > can help system software engineers
> > > to solve some rare power good signals synchronization problem caused
> > > by bios that caused
> > > boot time tsc sync check failure that usually would consume huge
> > > debugging engine for bringing up qualified linux system.
> > >
> > > Less depending on platform quirks should be good thing to linux for
> > > tsc && rcu support.
> >
> > Good point, I have procrastinated long enough.
> >
> > How about like this?
> 
> sorry, I meant a better algorithm to use tsc adjust register
> like the tried one in  arch/x86/kernel/tsc_sync.c

Ah, at this time I must leave that sort of task to others.

                                                        Thanx, Paul

> > ------------------------------------------------------------------------
> >
> > commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Thu May 27 11:03:28 2021 -0700
> >
> >     clocksource: Forgive repeated long-latency watchdog clocksource reads
> >
> >     Currently, the clocksource watchdog reacts to repeated long-latency
> >     clocksource reads by marking that clocksource unstable on the theory that
> >     these long-latency reads are a sign of a serious problem.  And this theory
> 
> Maybe we need to use other core's tsc as a reference clock instead of
> using HPET, which , to my knowledge ,  is the place where the problem happens.
> 
> Ruling out HPET and other slow clock devices as the obvious wrong choice
> of a reference clock for  tsc,  I guess there will be less chance we
> (in kernel code)
> will get bothered by other latency problems perceived in the clock
> source watchdog.

There have been a number of patches that identify systems that tend to
have unreliable HPET implementations.  Perhaps more systems are needed
there, or maybe some of those patches need to be backported to the
version of Linux that you are using.

There was also a patch that identified systems that are expected to have
reliable TSCs, and that de-emphasized HPET on such systems.

							Thanx, Paul

> >     does in fact have real-world support in the form of firmware issues [1].
> >
> >     However, it is also possible to trigger this using stress-ng on what
> >     the stress-ng man page terms "poorly designed hardware" [2].  And it
> >     is not necessarily a bad thing for the kernel to diagnose cases where
> >     heavy memory-contention workloads are being run on hardware that is not
> >     designed for this sort of use.
> >
> >     Nevertheless, it is quite possible that real-world use will result in
> >     some situation requiring that high-stress workloads run on hardware
> >     not designed to accommodate them, and also requiring that the kernel
> >     refrain from marking clocksources unstable.
> >
> >     Therefore, react to persistent long-latency reads by leaving the
> >     clocksource alone, but using the old 62.5-millisecond skew-detection
> >     threshold.  In addition, the offending clocksource is marked for
> >     re-initialization, which both restarts that clocksource with a clean bill
> >     of health and avoids false-positive skew reports on later watchdog checks.
> >     Once marked for re-initialization, the clocksource is not subjected to
> >     further watchdog checking until a subsequent successful read from that
> >     clocksource that is free of excessive delays.
> >
> >     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
> >     attempts result in long latencies, a warning (splat) will be emitted.
> >     This kernel boot parameter defaults to 100, and this warning can be
> >     disabled by setting it to zero or to a negative value.
> >
> >     [ paulmck: Apply feedback from Chao Gao ]
> >
> >     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> >     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> >     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> >     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> >     Tested-by: Chao Gao <chao.gao@intel.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 316027c3aadc..61d2436ae9df 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -600,6 +600,14 @@
> >                         loops can be debugged more effectively on production
> >                         systems.
> >
> > +       clocksource.max_cswd_coarse_reads= [KNL]
> > +                       Number of consecutive clocksource_watchdog()
> > +                       coarse reads (that is, clocksource reads that
> > +                       were unduly delayed) that are permitted before
> > +                       the kernel complains (gently).  Set to a value
> > +                       less than or equal to zero to suppress these
> > +                       complaints.
> > +
> >         clocksource.max_cswd_read_retries= [KNL]
> >                         Number of clocksource_watchdog() retries due to
> >                         external delays before the clock will be marked
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index 1d42d4b17327..3e925d9ffc31 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -110,6 +110,7 @@ struct clocksource {
> >         int                     rating;
> >         enum clocksource_ids    id;
> >         enum vdso_clock_mode    vdso_clock_mode;
> > +       unsigned int            n_coarse_reads;
> >         unsigned long           flags;
> >
> >         int                     (*enable)(struct clocksource *cs);
> > @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
> >  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
> >         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
> >
> > +extern int max_cswd_coarse_reads;
> >  extern ulong max_cswd_read_retries;
> >  void clocksource_verify_percpu(struct clocksource *cs);
> >
> > diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> > index df922f49d171..7e82500c400b 100644
> > --- a/kernel/time/clocksource-wdtest.c
> > +++ b/kernel/time/clocksource-wdtest.c
> > @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
> >                 else if (i <= max_cswd_read_retries)
> >                         s = ", expect message";
> >                 else
> > -                       s = ", expect clock skew";
> > +                       s = ", expect coarse-grained clock skew check and re-initialization";
> >                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> >                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> >                 schedule_timeout_uninterruptible(2 * HZ);
> >                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> > -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> > -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> > +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> >                 wdtest_ktime_clocksource_reset();
> >         }
> >
> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index b8a14d2fb5ba..796a127aabb9 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
> >         spin_unlock_irqrestore(&watchdog_lock, flags);
> >  }
> >
> > +int max_cswd_coarse_reads = 100;
> > +module_param(max_cswd_coarse_reads, int, 0644);
> > +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
> >  ulong max_cswd_read_retries = 3;
> >  module_param(max_cswd_read_retries, ulong, 0644);
> >  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> >                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> >                                         smp_processor_id(), watchdog->name, nretries);
> >                         }
> > -                       return true;
> > +                       cs->n_coarse_reads = 0;
> > +                       return false;
> >                 }
> > +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> > +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> > +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
> >         }
> >
> > -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> > -       return false;
> > +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > +       } else {
> > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > +       }
> > +       return true;
> >  }
> >
> >  static u64 csnow_mid;
> > @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> >         int next_cpu, reset_pending;
> >         int64_t wd_nsec, cs_nsec;
> >         struct clocksource *cs;
> > +       bool coarse;
> >         u32 md;
> >
> >         spin_lock(&watchdog_lock);
> > @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> >                         continue;
> >                 }
> >
> > -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > -                       /* Clock readout unreliable, so give it up. */
> > -                       __clocksource_unstable(cs);
> > -                       continue;
> > -               }
> > +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
> >
> >                 /* Clocksource initialized ? */
> >                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> >                     atomic_read(&watchdog_reset_pending)) {
> > -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > +                       if (!coarse)
> > +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
> >                         cs->wd_last = wdnow;
> >                         cs->cs_last = csnow;
> >                         continue;
> > @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> >                         continue;
> >
> >                 /* Check the deviation from the watchdog clocksource. */
> > -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > +               if (coarse) {
> > +                       md = 62500 * NSEC_PER_USEC;
> > +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> > +               } else {
> > +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > +               }
> >                 if (abs(cs_nsec - wd_nsec) > md) {
> >                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> >                                 smp_processor_id(), cs->name);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-11-01 16:57                             ` Paul E. McKenney
@ 2021-11-03  8:27                               ` Luming Yu
  2021-11-03 15:54                                 ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Luming Yu @ 2021-11-03  8:27 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Mon, Nov 1, 2021 at 12:57 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Nov 01, 2021 at 05:59:52AM -0400, Luming Yu wrote:
> > On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > > > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > > > >>>
> > > > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> Sorry, it looks almost as follows:
> > > > > > > > >>
> > > > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > > > >> @@ -119,6 +119,7 @@
> > > > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > > > >>   static int watchdog_running;
> > > > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > > > >>
> > > > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > >>          int next_cpu, reset_pending;
> > > > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > > > >>          struct clocksource *cs;
> > > > > > > > >> +       unsigned long max_jiffies;
> > > > > > > > >>          u32 md;
> > > > > > > > >>
> > > > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > > > >>                          continue;
> > > > > > > > >>
> > > > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > > > >> +                       continue;
> > > > > > > > >> +
> > > > > > > > >
> > > > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > > > clocksource.
> > > > > > > > >
> > > > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > > > current clocksource by using something we derived from the current
> > > > > > > > > clocksource.
> > > > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > > > the preferred clocksource.
> > > > > > > > >
> > > > > > > > > So I'm not really sure how this helps.
> > > > > > > > >
> > > > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > > > >
> > > > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > > > clocksource from being incorrectly switched.
> > > > > > >
> > > > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > > > clocksources that are misbehaving.
> > > > > > >
> > > > > > > In the case that the current clocksource is running very slow (imagine
> > > > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > > > so timers fire late.
> > > > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > > > clocksource..
> > > > > > >
> > > > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > > > will be delay executed.
> > > > > > >
> > > > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > > > clocksource.
> > > > > > >
> > > > > > > But the problem is that this situation and the one above (with the
> > > > > > > freq changing TSC), will look exactly the same.
> > > > > > >
> > > > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > > > slowly.
> > > > > > >
> > > > > > > > >
> > > > > > > > I think it will be better to add this to my patch:
> > > > > > > >   /*
> > > > > > > >    * Interval: 0.5sec.
> > > > > > > > - * MaxInterval: 1s.
> > > > > > > > + * MaxInterval: 20s.
> > > > > > > >    */
> > > > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > > > >
> > > > > > >
> > > > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > > > is a good idea.
> > > > > > >
> > > > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > > > error boundary might be useful too, as if the current clocksource and
> > > > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > > > scale the error bounds to be within an given error rate.
> > > > > >
> > > > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > > > it usually ends up in a false alarm triggered by the  clock source
> > > > > > watchdog for tsc.
> > > > > >
> > > > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > > > intelligent.
> > > > > > Its job is to find a real problem instead of causing a problem.
> > > > >
> > > > > And that proposal is now in mainline:
> > > >
> > > > Great! : -)
> > > > >
> > > > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > > > >
> > > > > The strategy is to disqualify a clock comparison if the reads took too
> > > > > long, and to retry the comparison.  If four consecutive comparison take
> > > > > too long, clock skew is reported.  The number of consecutive comparisons
> > > > > may be adjusted by the usual kernel boot parameter.
> > > > >
> > > > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > > > that can save manpower for other
> > > > > > more valuable problems to solve, or at least make it statistically a
> > > > > > problem less chance to happen.
> > > > >
> > > > > One additional piece that is still in prototype state in -rcu is to give
> > > > > clocksources some opportunity to resynchronize if there are too many
> > > > > slow comparisons.  This is intended to handle cases where clocks often
> > > >
> > > > if there is such tsc-sync algorithm existing in software, it really
> > > > can help system software engineers
> > > > to solve some rare power good signals synchronization problem caused
> > > > by bios that caused
> > > > boot time tsc sync check failure that usually would consume huge
> > > > debugging engine for bringing up qualified linux system.
> > > >
> > > > Less depending on platform quirks should be good thing to linux for
> > > > tsc && rcu support.
> > >
> > > Good point, I have procrastinated long enough.
> > >
> > > How about like this?
> >
> > sorry, I meant a better algorithm to use tsc adjust register
> > like the tried one in  arch/x86/kernel/tsc_sync.c
>
> Ah, at this time I must leave that sort of task to others.

sure.
if anyone is still debugging clock-source watchdog issue, It is
highly appreciated and would be more helpful to others
that we truly need  to decouple TSC with slow clock source.
we need to be aware of that when looking into clock source watchdog issues.

And many thanks to the value of your patch, I 'm seeing another report
that obviously has carried  "attempt 4, marking unstable"
one or two times at the beginning followed  less occurrence after
hundreds of boot..
: high priority --> medium

>
>                                                         Thanx, Paul
>
> > > ------------------------------------------------------------------------
> > >
> > > commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > Date:   Thu May 27 11:03:28 2021 -0700
> > >
> > >     clocksource: Forgive repeated long-latency watchdog clocksource reads
> > >
> > >     Currently, the clocksource watchdog reacts to repeated long-latency
> > >     clocksource reads by marking that clocksource unstable on the theory that
> > >     these long-latency reads are a sign of a serious problem.  And this theory
> >
> > Maybe we need to use other core's tsc as a reference clock instead of
> > using HPET, which , to my knowledge ,  is the place where the problem happens.
> >
> > Ruling out HPET and other slow clock devices as the obvious wrong choice
> > of a reference clock for  tsc,  I guess there will be less chance we
> > (in kernel code)
> > will get bothered by other latency problems perceived in the clock
> > source watchdog.
>
> There have been a number of patches that identify systems that tend to
> have unreliable HPET implementations.  Perhaps more systems are needed
> there, or maybe some of those patches need to be backported to the
> version of Linux that you are using.
>
> There was also a patch that identified systems that are expected to have
> reliable TSCs, and that de-emphasized HPET on such systems.
>
>                                                         Thanx, Paul
>
> > >     does in fact have real-world support in the form of firmware issues [1].
> > >
> > >     However, it is also possible to trigger this using stress-ng on what
> > >     the stress-ng man page terms "poorly designed hardware" [2].  And it
> > >     is not necessarily a bad thing for the kernel to diagnose cases where
> > >     heavy memory-contention workloads are being run on hardware that is not
> > >     designed for this sort of use.
> > >
> > >     Nevertheless, it is quite possible that real-world use will result in
> > >     some situation requiring that high-stress workloads run on hardware
> > >     not designed to accommodate them, and also requiring that the kernel
> > >     refrain from marking clocksources unstable.
> > >
> > >     Therefore, react to persistent long-latency reads by leaving the
> > >     clocksource alone, but using the old 62.5-millisecond skew-detection
> > >     threshold.  In addition, the offending clocksource is marked for
> > >     re-initialization, which both restarts that clocksource with a clean bill
> > >     of health and avoids false-positive skew reports on later watchdog checks.
> > >     Once marked for re-initialization, the clocksource is not subjected to
> > >     further watchdog checking until a subsequent successful read from that
> > >     clocksource that is free of excessive delays.
> > >
> > >     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
> > >     attempts result in long latencies, a warning (splat) will be emitted.
> > >     This kernel boot parameter defaults to 100, and this warning can be
> > >     disabled by setting it to zero or to a negative value.
> > >
> > >     [ paulmck: Apply feedback from Chao Gao ]
> > >
> > >     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> > >     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> > >     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> > >     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> > >     Tested-by: Chao Gao <chao.gao@intel.com>
> > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 316027c3aadc..61d2436ae9df 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -600,6 +600,14 @@
> > >                         loops can be debugged more effectively on production
> > >                         systems.
> > >
> > > +       clocksource.max_cswd_coarse_reads= [KNL]
> > > +                       Number of consecutive clocksource_watchdog()
> > > +                       coarse reads (that is, clocksource reads that
> > > +                       were unduly delayed) that are permitted before
> > > +                       the kernel complains (gently).  Set to a value
> > > +                       less than or equal to zero to suppress these
> > > +                       complaints.
> > > +
> > >         clocksource.max_cswd_read_retries= [KNL]
> > >                         Number of clocksource_watchdog() retries due to
> > >                         external delays before the clock will be marked
> > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > > index 1d42d4b17327..3e925d9ffc31 100644
> > > --- a/include/linux/clocksource.h
> > > +++ b/include/linux/clocksource.h
> > > @@ -110,6 +110,7 @@ struct clocksource {
> > >         int                     rating;
> > >         enum clocksource_ids    id;
> > >         enum vdso_clock_mode    vdso_clock_mode;
> > > +       unsigned int            n_coarse_reads;
> > >         unsigned long           flags;
> > >
> > >         int                     (*enable)(struct clocksource *cs);
> > > @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
> > >  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
> > >         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
> > >
> > > +extern int max_cswd_coarse_reads;
> > >  extern ulong max_cswd_read_retries;
> > >  void clocksource_verify_percpu(struct clocksource *cs);
> > >
> > > diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> > > index df922f49d171..7e82500c400b 100644
> > > --- a/kernel/time/clocksource-wdtest.c
> > > +++ b/kernel/time/clocksource-wdtest.c
> > > @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
> > >                 else if (i <= max_cswd_read_retries)
> > >                         s = ", expect message";
> > >                 else
> > > -                       s = ", expect clock skew";
> > > +                       s = ", expect coarse-grained clock skew check and re-initialization";
> > >                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> > >                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> > >                 schedule_timeout_uninterruptible(2 * HZ);
> > >                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> > > -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> > > -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> > > +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> > >                 wdtest_ktime_clocksource_reset();
> > >         }
> > >
> > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > index b8a14d2fb5ba..796a127aabb9 100644
> > > --- a/kernel/time/clocksource.c
> > > +++ b/kernel/time/clocksource.c
> > > @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
> > >         spin_unlock_irqrestore(&watchdog_lock, flags);
> > >  }
> > >
> > > +int max_cswd_coarse_reads = 100;
> > > +module_param(max_cswd_coarse_reads, int, 0644);
> > > +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
> > >  ulong max_cswd_read_retries = 3;
> > >  module_param(max_cswd_read_retries, ulong, 0644);
> > >  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > > @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > >                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > >                                         smp_processor_id(), watchdog->name, nretries);
> > >                         }
> > > -                       return true;
> > > +                       cs->n_coarse_reads = 0;
> > > +                       return false;
> > >                 }
> > > +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> > > +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> > > +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
> > >         }
> > >
> > > -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > > -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > -       return false;
> > > +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > +       } else {
> > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > +       }
> > > +       return true;
> > >  }
> > >
> > >  static u64 csnow_mid;
> > > @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >         int next_cpu, reset_pending;
> > >         int64_t wd_nsec, cs_nsec;
> > >         struct clocksource *cs;
> > > +       bool coarse;
> > >         u32 md;
> > >
> > >         spin_lock(&watchdog_lock);
> > > @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >                         continue;
> > >                 }
> > >
> > > -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > > -                       /* Clock readout unreliable, so give it up. */
> > > -                       __clocksource_unstable(cs);
> > > -                       continue;
> > > -               }
> > > +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
> > >
> > >                 /* Clocksource initialized ? */
> > >                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > >                     atomic_read(&watchdog_reset_pending)) {
> > > -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > > +                       if (!coarse)
> > > +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > >                         cs->wd_last = wdnow;
> > >                         cs->cs_last = csnow;
> > >                         continue;
> > > @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > >                         continue;
> > >
> > >                 /* Check the deviation from the watchdog clocksource. */
> > > -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > +               if (coarse) {
> > > +                       md = 62500 * NSEC_PER_USEC;
> > > +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> > > +               } else {
> > > +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > +               }
> > >                 if (abs(cs_nsec - wd_nsec) > md) {
> > >                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> > >                                 smp_processor_id(), cs->name);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-11-03  8:27                               ` Luming Yu
@ 2021-11-03 15:54                                 ` Paul E. McKenney
  2021-11-04 10:56                                   ` Luming Yu
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2021-11-03 15:54 UTC (permalink / raw)
  To: Luming Yu
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Wed, Nov 03, 2021 at 04:27:42AM -0400, Luming Yu wrote:
> On Mon, Nov 1, 2021 at 12:57 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Nov 01, 2021 at 05:59:52AM -0400, Luming Yu wrote:
> > > On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > > > > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > > > > >>>
> > > > > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >> Sorry, it looks almost as follows:
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > > > > >> @@ -119,6 +119,7 @@
> > > > > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > > > > >>   static int watchdog_running;
> > > > > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > > > > >>
> > > > > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > > >>          int next_cpu, reset_pending;
> > > > > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > > > > >>          struct clocksource *cs;
> > > > > > > > > >> +       unsigned long max_jiffies;
> > > > > > > > > >>          u32 md;
> > > > > > > > > >>
> > > > > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > > > > >>                          continue;
> > > > > > > > > >>
> > > > > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > > > > >> +                       continue;
> > > > > > > > > >> +
> > > > > > > > > >
> > > > > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > > > > clocksource.
> > > > > > > > > >
> > > > > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > > > > current clocksource by using something we derived from the current
> > > > > > > > > > clocksource.
> > > > > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > > > > the preferred clocksource.
> > > > > > > > > >
> > > > > > > > > > So I'm not really sure how this helps.
> > > > > > > > > >
> > > > > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > > > > >
> > > > > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > > > > clocksource from being incorrectly switched.
> > > > > > > >
> > > > > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > > > > clocksources that are misbehaving.
> > > > > > > >
> > > > > > > > In the case that the current clocksource is running very slow (imagine
> > > > > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > > > > so timers fire late.
> > > > > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > > > > clocksource..
> > > > > > > >
> > > > > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > > > > will be delay executed.
> > > > > > > >
> > > > > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > > > > clocksource.
> > > > > > > >
> > > > > > > > But the problem is that this situation and the one above (with the
> > > > > > > > freq changing TSC), will look exactly the same.
> > > > > > > >
> > > > > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > > > > slowly.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > I think it will be better to add this to my patch:
> > > > > > > > >   /*
> > > > > > > > >    * Interval: 0.5sec.
> > > > > > > > > - * MaxInterval: 1s.
> > > > > > > > > + * MaxInterval: 20s.
> > > > > > > > >    */
> > > > > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > > > > >
> > > > > > > >
> > > > > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > > > > is a good idea.
> > > > > > > >
> > > > > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > > > > error boundary might be useful too, as if the current clocksource and
> > > > > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > > > > scale the error bounds to be within an given error rate.
> > > > > > >
> > > > > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > > > > it usually ends up in a false alarm triggered by the  clock source
> > > > > > > watchdog for tsc.
> > > > > > >
> > > > > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > > > > intelligent.
> > > > > > > Its job is to find a real problem instead of causing a problem.
> > > > > >
> > > > > > And that proposal is now in mainline:
> > > > >
> > > > > Great! : -)
> > > > > >
> > > > > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > > > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > > > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > > > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > > > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > > > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > > > > >
> > > > > > The strategy is to disqualify a clock comparison if the reads took too
> > > > > > long, and to retry the comparison.  If four consecutive comparison take
> > > > > > too long, clock skew is reported.  The number of consecutive comparisons
> > > > > > may be adjusted by the usual kernel boot parameter.
> > > > > >
> > > > > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > > > > that can save manpower for other
> > > > > > > more valuable problems to solve, or at least make it statistically a
> > > > > > > problem less chance to happen.
> > > > > >
> > > > > > One additional piece that is still in prototype state in -rcu is to give
> > > > > > clocksources some opportunity to resynchronize if there are too many
> > > > > > slow comparisons.  This is intended to handle cases where clocks often
> > > > >
> > > > > if there is such tsc-sync algorithm existing in software, it really
> > > > > can help system software engineers
> > > > > to solve some rare power good signals synchronization problem caused
> > > > > by bios that caused
> > > > > boot time tsc sync check failure that usually would consume huge
> > > > > debugging engine for bringing up qualified linux system.
> > > > >
> > > > > Less depending on platform quirks should be good thing to linux for
> > > > > tsc && rcu support.
> > > >
> > > > Good point, I have procrastinated long enough.
> > > >
> > > > How about like this?
> > >
> > > sorry, I meant a better algorithm to use tsc adjust register
> > > like the tried one in  arch/x86/kernel/tsc_sync.c
> >
> > Ah, at this time I must leave that sort of task to others.
> 
> sure.
> if anyone is still debugging clock-source watchdog issue, It is
> highly appreciated and would be more helpful to others
> that we truly need  to decouple TSC with slow clock source.
> we need to be aware of that when looking into clock source watchdog issues.

There is a old saying that those with only one clock always know what
time it is, but those with more than one clock never can be sure.  ;-)

In the testing and validation phases, one should supply a known good
external clock against which to check everything.  I could imagine doing
something similar in production using any of a number of Internet time
services, but that project is beyond what I can commit to any time soon.
Note that it is not so simple because it must deal with arbitrary delays
across the Internet, the possibility of broken Internet time services,
and so on.

But in the common case, I bet it would greatly simplify your debugging
efforts!

> And many thanks to the value of your patch, I 'm seeing another report
> that obviously has carried  "attempt 4, marking unstable"
> one or two times at the beginning followed  less occurrence after
> hundreds of boot..
> : high priority --> medium

I am glad that it helped!

                                                        Thanx, Paul

> > > > ------------------------------------------------------------------------
> > > >
> > > > commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> > > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > > Date:   Thu May 27 11:03:28 2021 -0700
> > > >
> > > >     clocksource: Forgive repeated long-latency watchdog clocksource reads
> > > >
> > > >     Currently, the clocksource watchdog reacts to repeated long-latency
> > > >     clocksource reads by marking that clocksource unstable on the theory that
> > > >     these long-latency reads are a sign of a serious problem.  And this theory
> > >
> > > Maybe we need to use other core's tsc as a reference clock instead of
> > > using HPET, which , to my knowledge ,  is the place where the problem happens.
> > >
> > > Ruling out HPET and other slow clock devices as the obvious wrong choice
> > > of a reference clock for  tsc,  I guess there will be less chance we
> > > (in kernel code)
> > > will get bothered by other latency problems perceived in the clock
> > > source watchdog.
> >
> > There have been a number of patches that identify systems that tend to
> > have unreliable HPET implementations.  Perhaps more systems are needed
> > there, or maybe some of those patches need to be backported to the
> > version of Linux that you are using.
> >
> > There was also a patch that identified systems that are expected to have
> > reliable TSCs, and that de-emphasized HPET on such systems.
> >
> >                                                         Thanx, Paul
> >
> > > >     does in fact have real-world support in the form of firmware issues [1].
> > > >
> > > >     However, it is also possible to trigger this using stress-ng on what
> > > >     the stress-ng man page terms "poorly designed hardware" [2].  And it
> > > >     is not necessarily a bad thing for the kernel to diagnose cases where
> > > >     heavy memory-contention workloads are being run on hardware that is not
> > > >     designed for this sort of use.
> > > >
> > > >     Nevertheless, it is quite possible that real-world use will result in
> > > >     some situation requiring that high-stress workloads run on hardware
> > > >     not designed to accommodate them, and also requiring that the kernel
> > > >     refrain from marking clocksources unstable.
> > > >
> > > >     Therefore, react to persistent long-latency reads by leaving the
> > > >     clocksource alone, but using the old 62.5-millisecond skew-detection
> > > >     threshold.  In addition, the offending clocksource is marked for
> > > >     re-initialization, which both restarts that clocksource with a clean bill
> > > >     of health and avoids false-positive skew reports on later watchdog checks.
> > > >     Once marked for re-initialization, the clocksource is not subjected to
> > > >     further watchdog checking until a subsequent successful read from that
> > > >     clocksource that is free of excessive delays.
> > > >
> > > >     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
> > > >     attempts result in long latencies, a warning (splat) will be emitted.
> > > >     This kernel boot parameter defaults to 100, and this warning can be
> > > >     disabled by setting it to zero or to a negative value.
> > > >
> > > >     [ paulmck: Apply feedback from Chao Gao ]
> > > >
> > > >     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> > > >     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> > > >     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> > > >     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> > > >     Tested-by: Chao Gao <chao.gao@intel.com>
> > > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 316027c3aadc..61d2436ae9df 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -600,6 +600,14 @@
> > > >                         loops can be debugged more effectively on production
> > > >                         systems.
> > > >
> > > > +       clocksource.max_cswd_coarse_reads= [KNL]
> > > > +                       Number of consecutive clocksource_watchdog()
> > > > +                       coarse reads (that is, clocksource reads that
> > > > +                       were unduly delayed) that are permitted before
> > > > +                       the kernel complains (gently).  Set to a value
> > > > +                       less than or equal to zero to suppress these
> > > > +                       complaints.
> > > > +
> > > >         clocksource.max_cswd_read_retries= [KNL]
> > > >                         Number of clocksource_watchdog() retries due to
> > > >                         external delays before the clock will be marked
> > > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > > > index 1d42d4b17327..3e925d9ffc31 100644
> > > > --- a/include/linux/clocksource.h
> > > > +++ b/include/linux/clocksource.h
> > > > @@ -110,6 +110,7 @@ struct clocksource {
> > > >         int                     rating;
> > > >         enum clocksource_ids    id;
> > > >         enum vdso_clock_mode    vdso_clock_mode;
> > > > +       unsigned int            n_coarse_reads;
> > > >         unsigned long           flags;
> > > >
> > > >         int                     (*enable)(struct clocksource *cs);
> > > > @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
> > > >  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
> > > >         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
> > > >
> > > > +extern int max_cswd_coarse_reads;
> > > >  extern ulong max_cswd_read_retries;
> > > >  void clocksource_verify_percpu(struct clocksource *cs);
> > > >
> > > > diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> > > > index df922f49d171..7e82500c400b 100644
> > > > --- a/kernel/time/clocksource-wdtest.c
> > > > +++ b/kernel/time/clocksource-wdtest.c
> > > > @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
> > > >                 else if (i <= max_cswd_read_retries)
> > > >                         s = ", expect message";
> > > >                 else
> > > > -                       s = ", expect clock skew";
> > > > +                       s = ", expect coarse-grained clock skew check and re-initialization";
> > > >                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> > > >                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> > > >                 schedule_timeout_uninterruptible(2 * HZ);
> > > >                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> > > > -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> > > > -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> > > > +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> > > >                 wdtest_ktime_clocksource_reset();
> > > >         }
> > > >
> > > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > index b8a14d2fb5ba..796a127aabb9 100644
> > > > --- a/kernel/time/clocksource.c
> > > > +++ b/kernel/time/clocksource.c
> > > > @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
> > > >         spin_unlock_irqrestore(&watchdog_lock, flags);
> > > >  }
> > > >
> > > > +int max_cswd_coarse_reads = 100;
> > > > +module_param(max_cswd_coarse_reads, int, 0644);
> > > > +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
> > > >  ulong max_cswd_read_retries = 3;
> > > >  module_param(max_cswd_read_retries, ulong, 0644);
> > > >  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > > > @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > > >                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > > >                                         smp_processor_id(), watchdog->name, nretries);
> > > >                         }
> > > > -                       return true;
> > > > +                       cs->n_coarse_reads = 0;
> > > > +                       return false;
> > > >                 }
> > > > +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> > > > +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> > > > +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
> > > >         }
> > > >
> > > > -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > > > -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > -       return false;
> > > > +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> > > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> > > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > +       } else {
> > > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> > > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > +       }
> > > > +       return true;
> > > >  }
> > > >
> > > >  static u64 csnow_mid;
> > > > @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >         int next_cpu, reset_pending;
> > > >         int64_t wd_nsec, cs_nsec;
> > > >         struct clocksource *cs;
> > > > +       bool coarse;
> > > >         u32 md;
> > > >
> > > >         spin_lock(&watchdog_lock);
> > > > @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >                         continue;
> > > >                 }
> > > >
> > > > -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > > > -                       /* Clock readout unreliable, so give it up. */
> > > > -                       __clocksource_unstable(cs);
> > > > -                       continue;
> > > > -               }
> > > > +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
> > > >
> > > >                 /* Clocksource initialized ? */
> > > >                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > > >                     atomic_read(&watchdog_reset_pending)) {
> > > > -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > > > +                       if (!coarse)
> > > > +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > > >                         cs->wd_last = wdnow;
> > > >                         cs->cs_last = csnow;
> > > >                         continue;
> > > > @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > >                         continue;
> > > >
> > > >                 /* Check the deviation from the watchdog clocksource. */
> > > > -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > > +               if (coarse) {
> > > > +                       md = 62500 * NSEC_PER_USEC;
> > > > +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > > +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> > > > +               } else {
> > > > +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > > +               }
> > > >                 if (abs(cs_nsec - wd_nsec) > md) {
> > > >                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> > > >                                 smp_processor_id(), cs->name);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-11-03 15:54                                 ` Paul E. McKenney
@ 2021-11-04 10:56                                   ` Luming Yu
  0 siblings, 0 replies; 29+ messages in thread
From: Luming Yu @ 2021-11-04 10:56 UTC (permalink / raw)
  To: paulmck
  Cc: John Stultz, yanghui, Thomas Gleixner, Stephen Boyd, lkml, Shaohua Li

On Wed, Nov 3, 2021 at 11:54 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Nov 03, 2021 at 04:27:42AM -0400, Luming Yu wrote:
> > On Mon, Nov 1, 2021 at 12:57 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Nov 01, 2021 at 05:59:52AM -0400, Luming Yu wrote:
> > > > On Fri, Oct 22, 2021 at 7:36 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> > > > > > On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > > > > > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@linaro.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@bytedance.com> wrote:
> > > > > > > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > > > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > > > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@gmail.com> wrote:
> > > > > > > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > > > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > > > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > > > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > > > > > > >>>> verification? Do you think this idea is okay?
> > > > > > > > > > >>>
> > > > > > > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >> Sorry, it looks almost as follows:
> > > > > > > > > > >>
> > > > > > > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > > > > > > >> index b8a14d2..87f3b67 100644
> > > > > > > > > > >> --- a/kernel/time/clocksource.c
> > > > > > > > > > >> +++ b/kernel/time/clocksource.c
> > > > > > > > > > >> @@ -119,6 +119,7 @@
> > > > > > > > > > >>   static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > > > > > > >>   static DEFINE_SPINLOCK(watchdog_lock);
> > > > > > > > > > >>   static int watchdog_running;
> > > > > > > > > > >> +static unsigned long watchdog_start_time;
> > > > > > > > > > >>   static atomic_t watchdog_reset_pending;
> > > > > > > > > > >>
> > > > > > > > > > >>   static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > > > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > > > >>          int next_cpu, reset_pending;
> > > > > > > > > > >>          int64_t wd_nsec, cs_nsec;
> > > > > > > > > > >>          struct clocksource *cs;
> > > > > > > > > > >> +       unsigned long max_jiffies;
> > > > > > > > > > >>          u32 md;
> > > > > > > > > > >>
> > > > > > > > > > >>          spin_lock(&watchdog_lock);
> > > > > > > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > > > > > > >>                  if (atomic_read(&watchdog_reset_pending))
> > > > > > > > > > >>                          continue;
> > > > > > > > > > >>
> > > > > > > > > > >> +               max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > > > > > > >> +               if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > > > > > > >> +                       continue;
> > > > > > > > > > >> +
> > > > > > > > > > >
> > > > > > > > > > > Sorry, what is the benefit of using jiffies here?   Jiffies are
> > > > > > > > > > > updated by counting the number of tick intervals on the current
> > > > > > > > > > > clocksource.
> > > > > > > > > > >
> > > > > > > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > > > > > > current clocksource by using something we derived from the current
> > > > > > > > > > > clocksource.
> > > > > > > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > > > > > > the preferred clocksource.
> > > > > > > > > > >
> > > > > > > > > > > So I'm not really sure how this helps.
> > > > > > > > > > >
> > > > > > > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > > > > > > clocksource are out of sync.  Because again, timers are fired based on
> > > > > > > > > >
> > > > > > > > > > I think only have a signal ls not enough. we need to prevent
> > > > > > > > > > clocksource from being incorrectly switched.
> > > > > > > > >
> > > > > > > > > Right, but we also have to ensure that we also properly disqualify
> > > > > > > > > clocksources that are misbehaving.
> > > > > > > > >
> > > > > > > > > In the case that the current clocksource is running very slow (imagine
> > > > > > > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > > > > > > so timers fire late.
> > > > > > > > > So it would constantly seem like the irqs are being delayed, so with
> > > > > > > > > your logic we would not disqualify a clearly malfunctioning
> > > > > > > > > clocksource..
> > > > > > > > >
> > > > > > > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > > > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > > > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > > > > > > will be delay executed.
> > > > > > > > >
> > > > > > > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > > > > > > want a short burst of timer misbehavior to disqualify a good
> > > > > > > > > clocksource.
> > > > > > > > >
> > > > > > > > > But the problem is that this situation and the one above (with the
> > > > > > > > > freq changing TSC), will look exactly the same.
> > > > > > > > >
> > > > > > > > > So having a situation where if the watchdog clocksource thinks too
> > > > > > > > > much time has passed between watchdog timers, we can skip judgement,
> > > > > > > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > > > > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > > > > > > misbehavior and judge the current clocksource, as it may be running
> > > > > > > > > slowly.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > I think it will be better to add this to my patch:
> > > > > > > > > >   /*
> > > > > > > > > >    * Interval: 0.5sec.
> > > > > > > > > > - * MaxInterval: 1s.
> > > > > > > > > > + * MaxInterval: 20s.
> > > > > > > > > >    */
> > > > > > > > > >   #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > > > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > > > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > > > > > > is a good idea.
> > > > > > > > >
> > > > > > > > > The other proposal to calculate the error rate, rather than a fixed
> > > > > > > > > error boundary might be useful too, as if the current clocksource and
> > > > > > > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > > > > > > scale the error bounds to be within an given error rate.
> > > > > > > >
> > > > > > > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > > > > > > it usually ends up in a false alarm triggered by the  clock source
> > > > > > > > watchdog for tsc.
> > > > > > > >
> > > > > > > > I think Paul has a proposal to make a clock source watchdog to be more
> > > > > > > > intelligent.
> > > > > > > > Its job is to find a real problem instead of causing a problem.
> > > > > > >
> > > > > > > And that proposal is now in mainline:
> > > > > >
> > > > > > Great! : -)
> > > > > > >
> > > > > > > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > > > > > > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > > > > > > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > > > > > > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > > > > > > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > > > > > > db3a34e17433 clocksource: Retry clock read if long delays detected
> > > > > > >
> > > > > > > The strategy is to disqualify a clock comparison if the reads took too
> > > > > > > long, and to retry the comparison.  If four consecutive comparison take
> > > > > > > too long, clock skew is reported.  The number of consecutive comparisons
> > > > > > > may be adjusted by the usual kernel boot parameter.
> > > > > > >
> > > > > > > > so disabling it for known good-tsc might be a reasonable good idea
> > > > > > > > that can save manpower for other
> > > > > > > > more valuable problems to solve, or at least make it statistically a
> > > > > > > > problem less chance to happen.
> > > > > > >
> > > > > > > One additional piece that is still in prototype state in -rcu is to give
> > > > > > > clocksources some opportunity to resynchronize if there are too many
> > > > > > > slow comparisons.  This is intended to handle cases where clocks often
> > > > > >
> > > > > > if there is such tsc-sync algorithm existing in software, it really
> > > > > > can help system software engineers
> > > > > > to solve some rare power good signals synchronization problem caused
> > > > > > by bios that caused
> > > > > > boot time tsc sync check failure that usually would consume huge
> > > > > > debugging engine for bringing up qualified linux system.
> > > > > >
> > > > > > Less depending on platform quirks should be good thing to linux for
> > > > > > tsc && rcu support.
> > > > >
> > > > > Good point, I have procrastinated long enough.
> > > > >
> > > > > How about like this?
> > > >
> > > > sorry, I meant a better algorithm to use tsc adjust register
> > > > like the tried one in  arch/x86/kernel/tsc_sync.c
> > >
> > > Ah, at this time I must leave that sort of task to others.
> >
> > sure.
> > if anyone is still debugging clock-source watchdog issue, It is
> > highly appreciated and would be more helpful to others
> > that we truly need  to decouple TSC with slow clock source.
> > we need to be aware of that when looking into clock source watchdog issues.
>
> There is a old saying that those with only one clock always know what
> time it is, but those with more than one clock never can be sure.  ;-)

We do have many tsc clocks on these systems that can watch
each other.  However, you are right that these tsc clocks are getting ticks
from the same base clock on the board.

We just need a better algorithm that can be better than the existing one.

And the value of it is it could finally terminate the fake alarms from
the current clock
source watchdog forever by introducing a next level of ns-diff complexities
where your code itself could be a high latency source that needs to be
eliminated . : -(

so , we need a genuine genius idea to make such code to surface from
lkml's brave hearts,  otherwise
we will live with your clocksource watchdog code for quite a long time
in the foreseeable future.

Thanks again for your code. :  -)
>
> In the testing and validation phases, one should supply a known good
> external clock against which to check everything.  I could imagine doing
> something similar in production using any of a number of Internet time
> services, but that project is beyond what I can commit to any time soon.
> Note that it is not so simple because it must deal with arbitrary delays
> across the Internet, the possibility of broken Internet time services,
> and so on.
>
> But in the common case, I bet it would greatly simplify your debugging
> efforts!
>
> > And many thanks to the value of your patch, I 'm seeing another report
> > that obviously has carried  "attempt 4, marking unstable"
> > one or two times at the beginning followed  less occurrence after
> > hundreds of boot..
> > : high priority --> medium
>
> I am glad that it helped!
>
>                                                         Thanx, Paul
>
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
> > > > > Author: Paul E. McKenney <paulmck@kernel.org>
> > > > > Date:   Thu May 27 11:03:28 2021 -0700
> > > > >
> > > > >     clocksource: Forgive repeated long-latency watchdog clocksource reads
> > > > >
> > > > >     Currently, the clocksource watchdog reacts to repeated long-latency
> > > > >     clocksource reads by marking that clocksource unstable on the theory that
> > > > >     these long-latency reads are a sign of a serious problem.  And this theory
> > > >
> > > > Maybe we need to use other core's tsc as a reference clock instead of
> > > > using HPET, which , to my knowledge ,  is the place where the problem happens.
> > > >
> > > > Ruling out HPET and other slow clock devices as the obvious wrong choice
> > > > of a reference clock for  tsc,  I guess there will be less chance we
> > > > (in kernel code)
> > > > will get bothered by other latency problems perceived in the clock
> > > > source watchdog.
> > >
> > > There have been a number of patches that identify systems that tend to
> > > have unreliable HPET implementations.  Perhaps more systems are needed
> > > there, or maybe some of those patches need to be backported to the
> > > version of Linux that you are using.
> > >
> > > There was also a patch that identified systems that are expected to have
> > > reliable TSCs, and that de-emphasized HPET on such systems.
> > >
> > >                                                         Thanx, Paul
> > >
> > > > >     does in fact have real-world support in the form of firmware issues [1].
> > > > >
> > > > >     However, it is also possible to trigger this using stress-ng on what
> > > > >     the stress-ng man page terms "poorly designed hardware" [2].  And it
> > > > >     is not necessarily a bad thing for the kernel to diagnose cases where
> > > > >     heavy memory-contention workloads are being run on hardware that is not
> > > > >     designed for this sort of use.
> > > > >
> > > > >     Nevertheless, it is quite possible that real-world use will result in
> > > > >     some situation requiring that high-stress workloads run on hardware
> > > > >     not designed to accommodate them, and also requiring that the kernel
> > > > >     refrain from marking clocksources unstable.
> > > > >
> > > > >     Therefore, react to persistent long-latency reads by leaving the
> > > > >     clocksource alone, but using the old 62.5-millisecond skew-detection
> > > > >     threshold.  In addition, the offending clocksource is marked for
> > > > >     re-initialization, which both restarts that clocksource with a clean bill
> > > > >     of health and avoids false-positive skew reports on later watchdog checks.
> > > > >     Once marked for re-initialization, the clocksource is not subjected to
> > > > >     further watchdog checking until a subsequent successful read from that
> > > > >     clocksource that is free of excessive delays.
> > > > >
> > > > >     However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
> > > > >     attempts result in long latencies, a warning (splat) will be emitted.
> > > > >     This kernel boot parameter defaults to 100, and this warning can be
> > > > >     disabled by setting it to zero or to a negative value.
> > > > >
> > > > >     [ paulmck: Apply feedback from Chao Gao ]
> > > > >
> > > > >     Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
> > > > >     Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
> > > > >     Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
> > > > >     Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
> > > > >     Tested-by: Chao Gao <chao.gao@intel.com>
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > >
> > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > index 316027c3aadc..61d2436ae9df 100644
> > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > @@ -600,6 +600,14 @@
> > > > >                         loops can be debugged more effectively on production
> > > > >                         systems.
> > > > >
> > > > > +       clocksource.max_cswd_coarse_reads= [KNL]
> > > > > +                       Number of consecutive clocksource_watchdog()
> > > > > +                       coarse reads (that is, clocksource reads that
> > > > > +                       were unduly delayed) that are permitted before
> > > > > +                       the kernel complains (gently).  Set to a value
> > > > > +                       less than or equal to zero to suppress these
> > > > > +                       complaints.
> > > > > +
> > > > >         clocksource.max_cswd_read_retries= [KNL]
> > > > >                         Number of clocksource_watchdog() retries due to
> > > > >                         external delays before the clock will be marked
> > > > > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > > > > index 1d42d4b17327..3e925d9ffc31 100644
> > > > > --- a/include/linux/clocksource.h
> > > > > +++ b/include/linux/clocksource.h
> > > > > @@ -110,6 +110,7 @@ struct clocksource {
> > > > >         int                     rating;
> > > > >         enum clocksource_ids    id;
> > > > >         enum vdso_clock_mode    vdso_clock_mode;
> > > > > +       unsigned int            n_coarse_reads;
> > > > >         unsigned long           flags;
> > > > >
> > > > >         int                     (*enable)(struct clocksource *cs);
> > > > > @@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
> > > > >  #define TIMER_ACPI_DECLARE(name, table_id, fn)         \
> > > > >         ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
> > > > >
> > > > > +extern int max_cswd_coarse_reads;
> > > > >  extern ulong max_cswd_read_retries;
> > > > >  void clocksource_verify_percpu(struct clocksource *cs);
> > > > >
> > > > > diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
> > > > > index df922f49d171..7e82500c400b 100644
> > > > > --- a/kernel/time/clocksource-wdtest.c
> > > > > +++ b/kernel/time/clocksource-wdtest.c
> > > > > @@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
> > > > >                 else if (i <= max_cswd_read_retries)
> > > > >                         s = ", expect message";
> > > > >                 else
> > > > > -                       s = ", expect clock skew";
> > > > > +                       s = ", expect coarse-grained clock skew check and re-initialization";
> > > > >                 pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
> > > > >                 WRITE_ONCE(wdtest_ktime_read_ndelays, i);
> > > > >                 schedule_timeout_uninterruptible(2 * HZ);
> > > > >                 WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
> > > > > -               WARN_ON_ONCE((i <= max_cswd_read_retries) !=
> > > > > -                            !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
> > > > > +               WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
> > > > >                 wdtest_ktime_clocksource_reset();
> > > > >         }
> > > > >
> > > > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > index b8a14d2fb5ba..796a127aabb9 100644
> > > > > --- a/kernel/time/clocksource.c
> > > > > +++ b/kernel/time/clocksource.c
> > > > > @@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
> > > > >         spin_unlock_irqrestore(&watchdog_lock, flags);
> > > > >  }
> > > > >
> > > > > +int max_cswd_coarse_reads = 100;
> > > > > +module_param(max_cswd_coarse_reads, int, 0644);
> > > > > +EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
> > > > >  ulong max_cswd_read_retries = 3;
> > > > >  module_param(max_cswd_read_retries, ulong, 0644);
> > > > >  EXPORT_SYMBOL_GPL(max_cswd_read_retries);
> > > > > @@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> > > > >                                 pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
> > > > >                                         smp_processor_id(), watchdog->name, nretries);
> > > > >                         }
> > > > > -                       return true;
> > > > > +                       cs->n_coarse_reads = 0;
> > > > > +                       return false;
> > > > >                 }
> > > > > +               WARN_ONCE(max_cswd_coarse_reads > 0 &&
> > > > > +                         !(++cs->n_coarse_reads % max_cswd_coarse_reads),
> > > > > +                         "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
> > > > >         }
> > > > >
> > > > > -       pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
> > > > > -               smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > > -       return false;
> > > > > +       if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
> > > > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
> > > > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > > +       } else {
> > > > > +               pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
> > > > > +                       smp_processor_id(), watchdog->name, wd_delay, nretries);
> > > > > +       }
> > > > > +       return true;
> > > > >  }
> > > > >
> > > > >  static u64 csnow_mid;
> > > > > @@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >         int next_cpu, reset_pending;
> > > > >         int64_t wd_nsec, cs_nsec;
> > > > >         struct clocksource *cs;
> > > > > +       bool coarse;
> > > > >         u32 md;
> > > > >
> > > > >         spin_lock(&watchdog_lock);
> > > > > @@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >                         continue;
> > > > >                 }
> > > > >
> > > > > -               if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
> > > > > -                       /* Clock readout unreliable, so give it up. */
> > > > > -                       __clocksource_unstable(cs);
> > > > > -                       continue;
> > > > > -               }
> > > > > +               coarse = cs_watchdog_read(cs, &csnow, &wdnow);
> > > > >
> > > > >                 /* Clocksource initialized ? */
> > > > >                 if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > > > >                     atomic_read(&watchdog_reset_pending)) {
> > > > > -                       cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > > > > +                       if (!coarse)
> > > > > +                               cs->flags |= CLOCK_SOURCE_WATCHDOG;
> > > > >                         cs->wd_last = wdnow;
> > > > >                         cs->cs_last = csnow;
> > > > >                         continue;
> > > > > @@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > >                         continue;
> > > > >
> > > > >                 /* Check the deviation from the watchdog clocksource. */
> > > > > -               md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > > > +               if (coarse) {
> > > > > +                       md = 62500 * NSEC_PER_USEC;
> > > > > +                       cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > > > +                       pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
> > > > > +               } else {
> > > > > +                       md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> > > > > +               }
> > > > >                 if (abs(cs_nsec - wd_nsec) > md) {
> > > > >                         pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> > > > >                                 smp_processor_id(), cs->name);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-10-08 23:45 ` John Stultz
  2021-10-09  3:22   ` [External] " yanghui
  2021-10-09 14:04   ` brookxu
@ 2021-12-09 13:14   ` Gang Li
  2021-12-09 16:44     ` Paul E. McKenney
  2 siblings, 1 reply; 29+ messages in thread
From: Gang Li @ 2021-12-09 13:14 UTC (permalink / raw)
  To: john.stultz
  Cc: linux-kernel, sboyd, tglx, yanghui.def, brookxu.cn, paulmck,
	luming.yu, songmuchun, zhengqi.arch

Hi all,

How about changing clocksource_watchdog from classical timer to hrtimer?
hrtimer is more accurate and stable, and makes it much easier for us to 
deal with misjudgments.

-- 
Thanks
Gang Li


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
  2021-12-09 13:14   ` Gang Li
@ 2021-12-09 16:44     ` Paul E. McKenney
  0 siblings, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2021-12-09 16:44 UTC (permalink / raw)
  To: Gang Li
  Cc: john.stultz, linux-kernel, sboyd, tglx, yanghui.def, brookxu.cn,
	luming.yu, songmuchun, zhengqi.arch

On Thu, Dec 09, 2021 at 09:14:05PM +0800, Gang Li wrote:
> Hi all,
> 
> How about changing clocksource_watchdog from classical timer to hrtimer?
> hrtimer is more accurate and stable, and makes it much easier for us to deal
> with misjudgments.

To make hrtimer more accurate in the busy-system scenarios that are
under test, we would need to specify HRTIMER_MODE_HARD.  Otherwise,
hrtimer handlers execute in softirq, just like timers do.  So without
the HRTIMER_MODE_HARD, the hrtimers would be delayed just as badly as
we are seeing for the normal timers.

Unfortunately, use of HRTIMER_MODE_HARD has response-time consequences,
especially when the timer reads need to be retried.  So my guess is
that hrtimers are not a helpful path forward here.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-12-09 16:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08  8:03 [PATCH] Clocksource: Avoid misjudgment of clocksource yanghui
2021-10-08 23:45 ` John Stultz
2021-10-09  3:22   ` [External] " yanghui
2021-10-09  3:38     ` John Stultz
2021-10-09  9:02       ` yanghui
2021-10-12  5:02         ` John Stultz
2021-10-18 10:41           ` yanghui
2021-10-09 14:04   ` brookxu
2021-10-12  4:52     ` John Stultz
2021-10-12  5:21       ` brookxu
2021-10-12  5:29         ` John Stultz
2021-10-12  8:06           ` brookxu
2021-10-14  7:03             ` yanghui
2021-10-15  6:56               ` yanghui
2021-10-15  7:24                 ` brookxu
2021-10-18 16:14             ` John Stultz
2021-10-19  4:14               ` yanghui
2021-10-19  5:00                 ` John Stultz
2021-10-20 10:09                   ` Luming Yu
2021-10-20 17:49                     ` Paul E. McKenney
2021-10-21  9:37                       ` Luming Yu
2021-10-22 23:36                         ` Paul E. McKenney
2021-11-01  9:59                           ` Luming Yu
2021-11-01 16:57                             ` Paul E. McKenney
2021-11-03  8:27                               ` Luming Yu
2021-11-03 15:54                                 ` Paul E. McKenney
2021-11-04 10:56                                   ` Luming Yu
2021-12-09 13:14   ` Gang Li
2021-12-09 16:44     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.