linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
@ 2012-04-04 15:11 Prarit Bhargava
  2012-04-04 18:00 ` John Stultz
  0 siblings, 1 reply; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-04 15:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: Prarit Bhargava, John Stultz, Thomas Gleixner, Salman Qazi, stable

The clocksource code has a watchdog which runs once a minute.  The
clocksource watchdog calculates the number of nanoseconds since the last
time the watchdog ran and compares that value to the number of nanoseconds
that have passed on another clocksource.  If these values do not match (to
within .0625s) then the watchdog marks the current clocksource as unstable
and switches to another clocksource.

This works so long as the delta between calls of the watchdog is small
(maximum delta is ~18 seconds with the tsc).  After that, the
clocksource_cyc2ns() calculation will overflow and the calculated number
of ns returned is incorrect.

This causes the watchdog to erroneously mark the tsc as unstable and
switch to the hpet.

A long delay on the system is not usual, however, it can be reproduced
simply by doing

echo 1 > /proc/sys/kernel/sysrq
for i in `seq 10000`; do sleep 1000 & done
echo t > /proc/sysrq-trigger

on a 32-cpu system (which is not an unusual number of processes for this
size of system).  This floods the printk buffer and results in a
pause of approximately 600 seconds which prevents the clocksource watchdog
from running during that time.  On the next call, the watchdog erroneously
marks the tsc as unstable and switches to the hpet because the checked
values for the tsc overflow.

Fixing this is simple -- use mult_frac() in clocksource_cyc2ns().

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Salman Qazi <sqazi@google.com>
Cc: stable@kernel.org
---
 include/linux/clocksource.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index fbe89e1..1625ddb 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -273,7 +273,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
  */
 static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
 {
-	return ((u64) cycles * mult) >> shift;
+	return mult_frac(cycles, mult, (1UL << shift));
 }
 
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-04 15:11 [PATCH] clocksource, prevent overflow in clocksource_cyc2ns Prarit Bhargava
@ 2012-04-04 18:00 ` John Stultz
  2012-04-04 18:33   ` Prarit Bhargava
  0 siblings, 1 reply; 22+ messages in thread
From: John Stultz @ 2012-04-04 18:00 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/04/2012 08:11 AM, Prarit Bhargava wrote:
> The clocksource code has a watchdog which runs once a minute.  The
> clocksource watchdog calculates the number of nanoseconds since the last
> time the watchdog ran and compares that value to the number of nanoseconds
> that have passed on another clocksource.  If these values do not match (to
> within .0625s) then the watchdog marks the current clocksource as unstable
> and switches to another clocksource.
>
> This works so long as the delta between calls of the watchdog is small
> (maximum delta is ~18 seconds with the tsc).  After that, the
> clocksource_cyc2ns() calculation will overflow and the calculated number
> of ns returned is incorrect.
>
> This causes the watchdog to erroneously mark the tsc as unstable and
> switch to the hpet.
>
> A long delay on the system is not usual, however, it can be reproduced
> simply by doing
>
> echo 1>  /proc/sys/kernel/sysrq
> for i in `seq 10000`; do sleep 1000&  done
> echo t>  /proc/sysrq-trigger
>
> on a 32-cpu system (which is not an unusual number of processes for this
> size of system).  This floods the printk buffer and results in a
> pause of approximately 600 seconds which prevents the clocksource watchdog
> from running during that time.  On the next call, the watchdog erroneously
> marks the tsc as unstable and switches to the hpet because the checked
> values for the tsc overflow.

Hmm. This is a pretty good mix of existing problems. :)

We've seen clocksource watchdog false-positives w/ virtualization 
migrations before, and I think that recently got resolved . Also the 
huge dumps to printk's which runs with irqs off (usually over slow 
serial ports) causing lost time also is an outstanding issue.


> Fixing this is simple -- use mult_frac() in clocksource_cyc2ns().
>
> Signed-off-by: Prarit Bhargava<prarit@redhat.com>
> Cc: John Stultz<johnstul@us.ibm.com>
> Cc: Thomas Gleixner<tglx@linutronix.de>
> Cc: Salman Qazi<sqazi@google.com>
> Cc: stable@kernel.org
> ---
>   include/linux/clocksource.h |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index fbe89e1..1625ddb 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -273,7 +273,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
>    */
>   static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
>   {
> -	return ((u64) cycles * mult)>>  shift;
> +	return mult_frac(cycles, mult, (1UL<<  shift));
>   }
>
>
So I'm not sure this is actually the right fix.

The first issue is that the overflow is expected at a specific 
frequency, and the clocksource code and clockevent code work together to 
make sure the timekeeping core can accumulate time within that 
frequency. This is important not just for multiplication overflows, but 
also because some clocksource hardware wraps after just a few seconds, 
requiring we accumulate frequently.

The rest of the timekeeping code is designed with the requirement that 
the overflow you're solving with the above doesn't happen.  It doesn't 
mean it can't happen (as you've experience), but that you're running 
outside the expected bounds of the timekeeping code. So just fixing the 
multiply in clocksource_cyc2ns doesn't also change the assumptions 
elsewhere in the timekeeping code.

Additionally, gettimeofday/clock_gettime is a hotpath, and this adds 
extra overhead to the calculation.  So that's not great either.

So as I said above, I think there's really two issues at play here:
1) If you starve the timekeeping code from being able to periodically 
accumulate time bad things happen.
2) The clocksource watchdog in particular cannot suffer much delay at 
all, and falsely triggers on occasion.

Item #1 is really an issue of degree. We can try to be more flexible by 
stretching the expected length of the period out further. But some 
hardware that wraps quickly just cannot accommodate long intervals.  On 
the TSC we have quite a bit of time till it wraps, but by keeping the 
periods shorter, we allow for finer grained clock adjustments for NTP.  
So its somewhat of a balancing act between allowing for really precise 
timekeeping and robustness in the face of interrupt starvation (and we 
have only been getting better here, the periodic accumulation was only a 
tick back in the day, so SMIs or any long period with irqs off would 
cause lost time). Currently the TSC period should be 10 minutes before 
we overflow (which aligns to your 600 seconds above). I believe this to 
be a reasonable length of time where we should expect interrupts to be 
able to occur.   For example, other common clock hardware, like the 
acpi_pm can really only last ~5 seconds before it wraps.

Item #2 is where I think the real fix is needed. The watchdog has been 
useful in being able to detect flakey TSCs, but because running with a 
bad TSC can be very problematic for the kernel, its is quick on the 
trigger to disqualify it. Also, over the years, the expected use cases 
have grown for systems, and that has run into problems as it hits the 
bounds of the watchdogs' expectations. In fact, I suspect its not that 
the TSC wrapping but the HPET or acpi_pm that is being used as the 
"stable" watchdog clocksource.  One likely way to trigger a false 
positive is if the watchdog is the acpi_pm, but the clocksoruce is the 
TSC, we can delay/starve interrupts for quite some time, since the TSC 
won't wrap. But if the watchdog arrives late, and the acpi_pm clock has 
wrapped, then it assumes the TSC is broken. The problem with these 
issues is from the kernel's perspective you have to trust one of the two 
clocks. A similar issue can happen w/ the HPET but not as quickly (since 
it won't wrap as fast,  although maybe you're hitting the multiplication 
overflow there... honestly your 18 second interval time in the above 
isn't lining up for me right now, have to think more on that).

The hard problem is that the TSC can go wrong in *many* ways. Its a 
terrible architecture feature,  but there's nothing close to it 
performance wise, so we do what we can to try to make use of it when its 
safe.   Its variation of problems makes it hard to detect false 
positives when the "trusted" watchdog fails.

One idea I've had is to replace the watchdog clocksource with the 
read_persistent_clock() RTC  interface. This would allow us to avoid 
wrapping issues with watchdog clocksources, however it would also mean 
we couldn't really check the TSC accuracy for a number of seconds to 
minutes (due to the RTC granularity being just seconds itself). Or maybe 
it would be good to still use the watchdog, but double check against the 
RTC if we get a false positive long after boot?

Another idea I had tried to work on a few years ago was to try to catch 
watchdog overflows was checking if the TSC delta ==  watchdog delta + 
some multiple of watchdog intervals, but I didn't manage to get the math 
right there, and with small enough wrapping intervals and a long enough 
delay, a % error bound becomes impossible to trigger.

One idea might be to replace the cyc2ns w/ mult_frac in only the 
watchdog code.  I need to think on that some more (and maybe have you 
provide some debug output) to really understand how that's solving the 
issue for you, but it would be able to be done w/o affecting the other 
assumptions of the timekeeping core.

Thomas: Any other ideas? Looking at the watchdog code, since its not an 
hrtimer, could it also possibly be delayed past its .0625s threshold 
even just due to long NOHZ deep idle? I suspect quite a few assumptions 
have changed since that code was written. :)

thanks
-john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-04 18:00 ` John Stultz
@ 2012-04-04 18:33   ` Prarit Bhargava
  2012-04-05  1:08     ` John Stultz
  0 siblings, 1 reply; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-04 18:33 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable


> One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code. 
> I need to think on that some more (and maybe have you provide some debug output)
> to really understand how that's solving the issue for you, but it would be able
> to be done w/o affecting the other assumptions of the timekeeping core.
> 

Hey John,

After reading the initial part of your reply I was thinking about calling
mult_frac() directly from the watchdog code as well.

Here's some debug output I cobbled together to get an idea of how quickly the
overflow was happening.

[    5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask
0xFFFFFFFFFFFFFFFF mult 797281036 shift 31
[    5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask
0xFFFFFFFF mult 292935555 shift 22

These, of course, are just the basic data from the clocksources tsc and hpet.

This next output shows the overflow of the clocksource_cys2ns() function.

The output is

clocksource_watchdog: {cpu_number} cs cs->read() clocksource_cyc2ns(csnow -
csfirst) clocksource_cyc2ns_calling_mult_frac()

and

clocksource_watchdog: {cpu_number} wd wd->read() clocksource_cyc2ns(wdnow - wdfirst)

[   19.429080] clocksource_watchdog: {28} cs 37709489529 5410200359 14000134951
[   19.436964] clocksource_watchdog: {28} wd 200456437 14000133902
[   19.928803] clocksource_watchdog: {29} cs 39056111109 5910151011 14500085603
[   19.936678] clocksource_watchdog: {29} wd 207614817 14500084315
[   20.428600] clocksource_watchdog: {30} cs 40402933983 6410176395 15000110987
[   20.436477] clocksource_watchdog: {30} wd 214774270 15000109668
[   20.928376] clocksource_watchdog: {31} cs 41749696692 6910179442 15500114034
[   20.936260] clocksource_watchdog: {31} wd 221933402 15500112602
[   21.428091] clocksource_watchdog: {0} cs 43096297329 7410122318 16000056910
[   21.435878] clocksource_watchdog: {0} wd 229091680 16000055891
[   21.927855] clocksource_watchdog: {1} cs 44443031499 7910114770 16500049362
[   21.935642] clocksource_watchdog: {1} wd 236250654 16500047790
[   22.427640] clocksource_watchdog: {2} cs 45789820092 8410127427 17000062019
[   22.435426] clocksource_watchdog: {2} wd 243409925 17000060432
[   22.927479] clocksource_watchdog: {3} cs 47136754215 320259522 17500128706

^^^^ Right here.  The output of clocksource_cyc2ns() overflows, so in theory
anything with a delay of ~18 seconds or greater would cause an overflow in the
watchdog calculation.

[   22.935168] clocksource_watchdog: {3} wd 250569969 17500127061
[   23.427221] clocksource_watchdog: {4} cs 48483425124 820228487 18000097671
[   23.434916] clocksource_watchdog: {4} wd 257728618 18000096262
[   23.926970] clocksource_watchdog: {5} cs 49830120549 1320206554 18500075738
[   23.934762] clocksource_watchdog: {5} wd 264887389 18500073983
[   24.426772] clocksource_watchdog: {6} cs 51176954790 1820236158 19000105342
[   24.434564] clocksource_watchdog: {6} wd 272046903 19000103596
[   24.926565] clocksource_watchdog: {7} cs 52523765154 2320256898 19500126082
[   24.934343] clocksource_watchdog: {7} wd 279206289 19500124270

[Aside: Eventually, the hpet overflows too!]

So on to the reproducer ... in which I did

echo 1 > /proc/sys/kernel/sysrq
for i in `seq 10000`; do sleep 1000 &  done
echo t > /proc/sysrq-trigger


Apr  2 20:05:17 intel-canoepass-02 kernel: [  104.429864]  [<ffffffff814fb7db>]
do_nanosleep+0x8b/0xc0
Apr  2 20:05:17 intel-canoepass-02 kernel: [  104.429866]  [<ffffffff81097324>]
hrtimer_nanosleep+0xc4/0x180

<snip lots of backtraces for 14000 tasks>

[  621.639952] Clocksource tsc unstable (delta = -4589931578 ns)

And then obviously the system switches to hpet.  If, however, I switch to using
the mult_func() the problem goes away so I'm sure my theory is right.

Let me know if you'd like to see any other debug from this.  I can certainly
dump anything you want.

P.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-04 18:33   ` Prarit Bhargava
@ 2012-04-05  1:08     ` John Stultz
  2012-04-05 11:00       ` Prarit Bhargava
  2012-04-05 12:27       ` Prarit Bhargava
  0 siblings, 2 replies; 22+ messages in thread
From: John Stultz @ 2012-04-05  1:08 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/04/2012 11:33 AM, Prarit Bhargava wrote:
>> One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code.
>> I need to think on that some more (and maybe have you provide some debug output)
>> to really understand how that's solving the issue for you, but it would be able
>> to be done w/o affecting the other assumptions of the timekeeping core.
>>
> Hey John,
>
> After reading the initial part of your reply I was thinking about calling
> mult_frac() directly from the watchdog code as well.
>
> Here's some debug output I cobbled together to get an idea of how quickly the
> overflow was happening.
>
> [    5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask
> 0xFFFFFFFFFFFFFFFF mult 797281036 shift 31
> [    5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask
> 0xFFFFFFFF mult 292935555 shift 22
>
> These, of course, are just the basic data from the clocksources tsc and hpet.

If I'm doing the math right, these are ~2.7 Ghz cpus?

So what kernel version are you using?

In trying to reproduce this locally against Linus' HEAD on a much 
smaller system (single core + HT 1.6Ghz), I got:
[    6.611366] clocksource_watchdog: {0} cs tsc csfirst 36177888648 mask 
ffffffffffffffff mult 10485747 shift 24
[    6.611596] clocksource_watchdog: {0} wd hpet wdfirst 169168400 mask 
ffffffff mult 2684354560 shift 26

Note the smaller shift values. Not too long ago the shift calculation 
was adjusted to allow for longer periods between interrupts,  so I 
suspect you're on an older kernel.

Further, using your debug patch on my system, it was well beyond 10 
minutes before the debug overflow occurred.  And similarly I couldn't 
trip the watchdog trigger using sysrq-t (but again, only two threads 
here, so not nearly as much data to print as you have).

Could you verify that the issue you're seeing is still is present w/ 
current mainline?  Please don't take this as me dismissing your 
problem!  As I mentioned earlier there are some known issues w/ the 
clocksource watchdog code. But I want to narrow down if you're  problem  
is currently present in mainline or only in older kernels, as that will 
help us find the proper fix.

thanks
-john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05  1:08     ` John Stultz
@ 2012-04-05 11:00       ` Prarit Bhargava
  2012-04-05 16:23         ` John Stultz
  2012-04-05 12:27       ` Prarit Bhargava
  1 sibling, 1 reply; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-05 11:00 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable



On 04/04/2012 09:08 PM, John Stultz wrote:
> On 04/04/2012 11:33 AM, Prarit Bhargava wrote:
>>> One idea might be to replace the cyc2ns w/ mult_frac in only the watchdog code.
>>> I need to think on that some more (and maybe have you provide some debug output)
>>> to really understand how that's solving the issue for you, but it would be able
>>> to be done w/o affecting the other assumptions of the timekeeping core.
>>>
>> Hey John,
>>
>> After reading the initial part of your reply I was thinking about calling
>> mult_frac() directly from the watchdog code as well.
>>
>> Here's some debug output I cobbled together to get an idea of how quickly the
>> overflow was happening.
>>
>> [    5.435323] clocksource_watchdog: {0} cs tsc csfirst 227349443638728 mask
>> 0xFFFFFFFFFFFFFFFF mult 797281036 shift 31
>> [    5.444930] clocksource_watchdog: {0} wd hpet wdfirst 78332535 mask
>> 0xFFFFFFFF mult 292935555 shift 22
>>
>> These, of course, are just the basic data from the clocksources tsc and hpet.
> 
> If I'm doing the math right, these are ~2.7 Ghz cpus?

Yes.

> 
> So what kernel version are you using?

I was on an earlier version of Fedora (F16) ... but I'll jump forward and see if
I can still hit it.

> 
> In trying to reproduce this locally against Linus' HEAD on a much smaller system
> (single core + HT 1.6Ghz), I got:
> [    6.611366] clocksource_watchdog: {0} cs tsc csfirst 36177888648 mask
> ffffffffffffffff mult 10485747 shift 24
> [    6.611596] clocksource_watchdog: {0} wd hpet wdfirst 169168400 mask ffffffff
> mult 2684354560 shift 26
> 
> Note the smaller shift values. Not too long ago the shift calculation was
> adjusted to allow for longer periods between interrupts,  so I suspect you're on
> an older kernel.
> 
> Further, using your debug patch on my system, it was well beyond 10 minutes
> before the debug overflow occurred.  And similarly I couldn't trip the watchdog
> trigger using sysrq-t (but again, only two threads here, so not nearly as much
> data to print as you have).

I'm going to try this on a 32-cpu system (running the previously mentioned test)
with linux.git HEAD.

> 
> Could you verify that the issue you're seeing is still is present w/ current
> mainline?  Please don't take this as me dismissing your problem!  As I mentioned

Absolutely :)  I didn't take it that way at all. .... when I get in this AM I'll
bang out a test and see if I can cause this to happen with sysrq-t.  Keep in
mind that 10000 threads is the *minimum* I was able to cause this with, which is
only ~315 threads/cpu, which isn't a lot :/.  At that number of threads the dump
takes about 6 mins.  Doubling it, IIRC, exceeded 10 mins.

> earlier there are some known issues w/ the clocksource watchdog code. But I want
> to narrow down if you're  problem  is currently present in mainline or only in
> older kernels, as that will help us find the proper fix.

Thanks John,

P.

> 
> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05  1:08     ` John Stultz
  2012-04-05 11:00       ` Prarit Bhargava
@ 2012-04-05 12:27       ` Prarit Bhargava
  2012-04-05 16:45         ` John Stultz
                           ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-05 12:27 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

> 
> So what kernel version are you using?

I retested using top of the linux.git tree, running

echo 1 > /proc/sys/kernel/sysrq
for i in `seq 10000`; do sleep 1000 & done
echo t > /proc/sysrq-trigger

and I no longer see a problem.  However, if I increase the number of threads to
1000/cpu I get

Clocksource %s unstable (delta = -429565427)
Clocksource switching to hpet

> to narrow down if you're  problem  is currently present in mainline or only in
> older kernels, as that will help us find the proper fix.

If I hack in (sorry for the cut-and-paste)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c958338..f38b8d0 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data)
                        continue;
                }

-               wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last) & watchdog->m
-                                            watchdog->mult, watchdog->shift);
+               /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last) & watchdog-
+                                            watchdog->mult, watchdog->shift);*/
+               wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult,
+                                   1UL << watchdog->shift);
+
+               /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) &
+                                            cs->mask, cs->mult, cs->shift);*/
+               cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult,
+                                   1UL << cs->shift);

-               cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) &
-                                            cs->mask, cs->mult, cs->shift);
                cs->cs_last = csnow;
                cs->wd_last = wdnow;


then I don't see unstable messages.

I think the problem is still here but it only happens in extreme cases.

P.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05 11:00       ` Prarit Bhargava
@ 2012-04-05 16:23         ` John Stultz
  0 siblings, 0 replies; 22+ messages in thread
From: John Stultz @ 2012-04-05 16:23 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/05/2012 04:00 AM, Prarit Bhargava wrote:
>
> On 04/04/2012 09:08 PM, John Stultz wrote:
>> So what kernel version are you using?
> I was on an earlier version of Fedora (F16) ... but I'll jump forward and see if
> I can still hit it.

Ok, but if you have the specific kernel version, it would help, since 
even if the issue is resolved upstream, we can see if a backport to 
stable is appropriate.

> Keep in mind that 10000 threads is the *minimum* I was able to cause 
> this with, which is only ~315 threads/cpu, which isn't a lot :/. At 
> that number of threads the dump takes about 6 mins. Doubling it, IIRC, 
> exceeded 10 mins.

At that point, if we're starving the system of interrupts for over 10 
minutes, a number of other non-timekeeping issues can crop up. I've seen 
scsci controllers fall over if their heart-beat checks don't come in, etc.

I suspect if printk dumps taking 10 minutes or more is going to be 
considered "acceptable" behavior, we may need to see about adding some 
breaks in the printk code, so that the system can take a few necessary 
irqs.

thanks
-john



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05 12:27       ` Prarit Bhargava
@ 2012-04-05 16:45         ` John Stultz
  2012-04-06 23:29         ` Thomas Gleixner
  2012-04-18 23:20         ` John Stultz
  2 siblings, 0 replies; 22+ messages in thread
From: John Stultz @ 2012-04-05 16:45 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/05/2012 05:27 AM, Prarit Bhargava wrote:
>> So what kernel version are you using?
> I retested using top of the linux.git tree, running
>
> echo 1>  /proc/sys/kernel/sysrq
> for i in `seq 10000`; do sleep 1000&  done
> echo t>  /proc/sysrq-trigger
>
> and I no longer see a problem.  However, if I increase the number of threads to
> 1000/cpu I get
>
> Clocksource %s unstable (delta = -429565427)
> Clocksource switching to hpet

Yea, so this is with the 12minute + stall starving the system from irqs? 
Again, its all a matter of degree.

>
>> to narrow down if you're  problem  is currently present in mainline or only in
>> older kernels, as that will help us find the proper fix.
> If I hack in (sorry for the cut-and-paste)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c958338..f38b8d0 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data)
>                          continue;
>                  }
>
> -               wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)&  watchdog->m
> -                                            watchdog->mult, watchdog->shift);
> +               /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)&  watchdog-
> +                                            watchdog->mult, watchdog->shift);*/
> +               wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult,
> +                                   1UL<<  watchdog->shift);
> +
> +               /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)&
> +                                            cs->mask, cs->mult, cs->shift);*/
> +               cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult,
> +                                   1UL<<  cs->shift);
>
> -               cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)&
> -                                            cs->mask, cs->mult, cs->shift);
>                  cs->cs_last = csnow;
>                  cs->wd_last = wdnow;
>
>
> then I don't see unstable messages.
>
> I think the problem is still here but it only happens in extreme cases.

I suspect if you were closely watching the clock  you'd probably see 
some odd time inconsistencies as well. Its all due to running outside 
the expected bounds of the system.  If you were to use the acpi_pm 
clocksource and stall the system for 10+ minutes, when you returned the 
system clock would be 10 minutes behind.

I think if 10 minutes printk dumps is going to be expected behavior 
(which given the size of these machines, isn't unreasonable), we may 
need to find a way to allow printk to take a break every so often and 
allow irqs to happen. Disabling irqs for 10+ minutes can cause all sorts 
of system trouble.  I've seen scsi adapters hang if their heart beat 
code didn't get an irq every 10 minutes or so.

I had proposed something like this awhile back, but the issue cropping 
up there was lost time due to long printks holding off irqs while we're 
using the jiffies clocks in early boot, so it was at a much smaller scale.

Could you give the following a try and let me know if it helps?

thanks
-john


Only flush 1k chunks max from console_unlock to allow IRQ starvation 
from happening.

NOT FOR INCLUSION

Signed-off-by: John Stultz <john.stultz@linaro.org>

diff --git a/kernel/printk.c b/kernel/printk.c
index b663c2c..b699cc7 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1272,6 +1272,7 @@ void console_unlock(void)
  	unsigned long flags;
  	unsigned _con_start, _log_end;
  	unsigned wake_klogd = 0, retry = 0;
+	unsigned chunk_size, length;

  	if (console_suspended) {
  		up(&console_sem);
@@ -1280,15 +1281,18 @@ void console_unlock(void)

  	console_may_schedule = 0;

+	chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */
  again:
  	for ( ; ; ) {
  		raw_spin_lock_irqsave(&logbuf_lock, flags);
  		wake_klogd |= log_start - log_end;
  		if (con_start == log_end)
  			break;			/* Nothing to print */
+		length = (log_end - con_start)&  LOG_BUF_MASK;
+		length = min(length , chunk_size);
  		_con_start = con_start;
-		_log_end = log_end;
-		con_start = log_end;		/* Flush */
+		_log_end = (con_start + length)&  LOG_BUF_MASK;
+		con_start = _log_end;		/* Flush */
  		raw_spin_unlock(&logbuf_lock);
  		stop_critical_timings();	/* don't trace print latency */
  		call_console_drivers(_con_start, _log_end);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05 12:27       ` Prarit Bhargava
  2012-04-05 16:45         ` John Stultz
@ 2012-04-06 23:29         ` Thomas Gleixner
  2012-04-07 13:47           ` Prarit Bhargava
  2012-04-18 23:20         ` John Stultz
  2 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-06 23:29 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable

On Thu, 5 Apr 2012, Prarit Bhargava wrote:

> > 
> > So what kernel version are you using?
> 
> I retested using top of the linux.git tree, running
> 
> echo 1 > /proc/sys/kernel/sysrq
> for i in `seq 10000`; do sleep 1000 & done
> echo t > /proc/sysrq-trigger
> 
> and I no longer see a problem.  However, if I increase the number of threads to
> 1000/cpu I get
> 
> Clocksource %s unstable (delta = -429565427)
> Clocksource switching to hpet

You are issuing a command which puts the kernel into a state where is
dumps data for several seconds with interrupts disabled. And you expect that
everything can cope with that?
 
> If I hack in (sorry for the cut-and-paste)
> ....
> +               cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult,
> +                                   1UL << cs->shift);
> 
> -               cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) &
> -                                            cs->mask, cs->mult, cs->shift);
> then I don't see unstable messages.

That does not make your approach more correct. The HPET wraparound
time is ~3 seconds, so you screwed everything already, when your dump
lasts longer than that. And there are clocksources which wrap way
faster.

No, you can't fix that by hacking the timer code. A wraparound CANNOT
be fixed by hacks.

So instead of fiddling in the victims, please fix the root cause,
i.e. that stupid sysrq-t code which should not need to have interrupts
disabled just to dump all that state. If that's not possible, send a
patch to the sysrq documentation and warn about the consequences.

But stay away from code which is correct already. You CANNOT fix a
problem which is caused by abnormal system state by hacking the code
which is exposing the problem.

All you do is making hot pathes more expensive with a very dubious
value. The time related calls are hotpath functions and optimized.

Aside of that you are breaking all architectures which do not have a
native 64/32 instruction.

This mult_frac stuff is not going to happen, period.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-06 23:29         ` Thomas Gleixner
@ 2012-04-07 13:47           ` Prarit Bhargava
  0 siblings, 0 replies; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-07 13:47 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: John Stultz, linux-kernel, Salman Qazi, stable



On 04/06/2012 07:29 PM, Thomas Gleixner wrote:
> On Thu, 5 Apr 2012, Prarit Bhargava wrote:
> 
>>>
>>> So what kernel version are you using?
>>
>> I retested using top of the linux.git tree, running
>>
>> echo 1 > /proc/sys/kernel/sysrq
>> for i in `seq 10000`; do sleep 1000 & done
>> echo t > /proc/sysrq-trigger
>>
>> and I no longer see a problem.  However, if I increase the number of threads to
>> 1000/cpu I get
>>
>> Clocksource %s unstable (delta = -429565427)
>> Clocksource switching to hpet
> 
> You are issuing a command which puts the kernel into a state where is
> dumps data for several seconds with interrupts disabled. And you expect that
> everything can cope with that?

Yes actually.  I do expect that everything "copes" with it.  I don't find it
unreasonable with system sizes increasing that functionality that has been
around for years works.

However, I also understand that no one expected or saw this problem -- I'm not
blaming anyone or screaming "Hey!  This is broken!!!".

>  
>> If I hack in (sorry for the cut-and-paste)
>> ....
>> +               cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult,
>> +                                   1UL << cs->shift);
>>
>> -               cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last) &
>> -                                            cs->mask, cs->mult, cs->shift);
>> then I don't see unstable messages.
> 
> That does not make your approach more correct. The HPET wraparound
> time is ~3 seconds, so you screwed everything already, when your dump
> lasts longer than that. And there are clocksources which wrap way
> faster.
> 
> No, you can't fix that by hacking the timer code. A wraparound CANNOT
> be fixed by hacks.
> 
> So instead of fiddling in the victims, please fix the root cause,
> i.e. that stupid sysrq-t code which should not need to have interrupts
> disabled just to dump all that state. If that's not possible, send a
> patch to the sysrq documentation and warn about the consequences.
> 
> But stay away from code which is correct already. You CANNOT fix a
> problem which is caused by abnormal system state by hacking the code
> which is exposing the problem.
> 
> All you do is making hot pathes more expensive with a very dubious
> value. The time related calls are hotpath functions and optimized.
> 
> Aside of that you are breaking all architectures which do not have a
> native 64/32 instruction.
> 
> This mult_frac stuff is not going to happen, period.

Okay -- thanks for the info.  Much appreciated.

P.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-05 12:27       ` Prarit Bhargava
  2012-04-05 16:45         ` John Stultz
  2012-04-06 23:29         ` Thomas Gleixner
@ 2012-04-18 23:20         ` John Stultz
  2012-04-18 23:59           ` Prarit Bhargava
  2 siblings, 1 reply; 22+ messages in thread
From: John Stultz @ 2012-04-18 23:20 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/05/2012 05:27 AM, Prarit Bhargava wrote:
>> So what kernel version are you using?
> I retested using top of the linux.git tree, running
>
> echo 1>  /proc/sys/kernel/sysrq
> for i in `seq 10000`; do sleep 1000&  done
> echo t>  /proc/sysrq-trigger
>
> and I no longer see a problem.  However, if I increase the number of threads to
> 1000/cpu I get
>
> Clocksource %s unstable (delta = -429565427)
> Clocksource switching to hpet
>
>> to narrow down if you're  problem  is currently present in mainline or only in
>> older kernels, as that will help us find the proper fix.
> If I hack in (sorry for the cut-and-paste)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c958338..f38b8d0 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -279,11 +279,16 @@ static void clocksource_watchdog(unsigned long data)
>                          continue;
>                  }
>
> -               wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)&  watchdog->m
> -                                            watchdog->mult, watchdog->shift);
> +               /*wd_nsec = clocksource_cyc2ns((wdnow - cs->wd_last)&  watchdog-
> +                                            watchdog->mult, watchdog->shift);*/
> +               wd_nsec = mult_frac(((wdnow - cs->wd_last), watchdog->mult,
> +                                   1UL<<  watchdog->shift);
> +
> +               /*cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)&
> +                                            cs->mask, cs->mult, cs->shift);*/
> +               cs_nsec = mult_frac(((csnow - cs->cs_last), cs->mult,
> +                                   1UL<<  cs->shift);
>
> -               cs_nsec = clocksource_cyc2ns((csnow - cs->cs_last)&
> -                                            cs->mask, cs->mult, cs->shift);
>                  cs->cs_last = csnow;
>                  cs->wd_last = wdnow;
>
>
> then I don't see unstable messages.
>
> I think the problem is still here but it only happens in extreme cases.
>

Hey Prarit,
     So at tglx's prodding I took a look at the sysrq code, and the 
problem is the entire sysrq path runs with irqs disabled. As you 
note,with many cores and many processes, it can take a while to spit all 
that data out.

Instead of the earlier hack I suggested, would you try the following 
simpler one? I suspect we just need to touch the clocksource watchdog 
before returning.  This should avoid the TSC disqualification you're 
seeing. On systems using clocksources that wrap, we'll still lose time, 
since no time accumulation occurred during the long irq off period, but 
I think that's acceptable given this is not normal operation.

Let me know if this helps.

thanks
-john

As irqs may be disabled for quite some time in the sysrq path, touch 
clocksource
watchdog before re-enabling interrupts.

Signed-off-by: John Stultz <john.stultz@linaro.org>

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 05728894..28fe2cb 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -41,6 +41,7 @@
  #include<linux/slab.h>
  #include<linux/input.h>
  #include<linux/uaccess.h>
+#include<linux/clocksource.h>

  #include<asm/ptrace.h>
  #include<asm/irq_regs.h>
@@ -544,6 +545,7 @@ void __handle_sysrq(int key, bool check_mask)
  		printk("\n");
  		console_loglevel = orig_log_level;
  	}
+	clocksource_touch_watchdog();
  	spin_unlock_irqrestore(&sysrq_key_table_lock, flags);
  }



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-18 23:20         ` John Stultz
@ 2012-04-18 23:59           ` Prarit Bhargava
  2012-04-19  0:18             ` John Stultz
  2012-04-19 12:37             ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-18 23:59 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable


> 
> Hey Prarit,
>     So at tglx's prodding I took a look at the sysrq code, and the problem is
> the entire sysrq path runs with irqs disabled. As you note,with many cores and
> many processes, it can take a while to spit all that data out.
> 
> Instead of the earlier hack I suggested, would you try the following simpler
> one? I suspect we just need to touch the clocksource watchdog before returning. 
> This should avoid the TSC disqualification you're seeing. On systems using
> clocksources that wrap, we'll still lose time, since no time accumulation
> occurred during the long irq off period, but I think that's acceptable given
> this is not normal operation.
> 
> Let me know if this helps.

Hey John,

Thanks for continuing to work on this.  Coincidentally that exact patch was my
first attempt at resolving the problem as well.  The problem is that even after
touching the clocksource watchdog and restoring irqs the printk buffer can take
a LONG time to flush -- and that still will cause an overflow comparison.  So
fixing it with just a touch_clocksource_watchdog() isn't the right thing to do
IMO.  Maybe a combination of the printk() patch you suggested earlier and the
touch_clocksource_watchdog() is the right way to go but I'll leave that up to
tglx and yourself to decide on a correct fix.

There's also some additional information that I've been gathering on this issue;
I have seen *idle* systems switch to the hpet because the clocksource watchdog
hits the overflow comparison.  As expected it happens much less frequently on
newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based)
due to the difference in shift values but it is happening in both cases.

The odd thing about this behaviour is that I would expect it to occur with the
same frequency on small systems as it does on large systems with linux.git as
the watchdog fires once/second.  AFAICT I do not see this on small systems but
see it only on systems with greater than 24 cpus (both Intel and AMD).

Using debug code similar to the dump code I previously provided, I can see that
every so often these large systems can hit a case where the tsc wraps and the
hpet is still monotonically increasing.  When the unstable calculation is
performed the result is obviously affected by the overflow.  Sometimes this
comparison overflow happens within 18 minutes, other times it can take hours or
days.

The other part of this puzzle is that if switch between the tsc and hpet every
10 seconds, and run a gettimeofday() comparison program, the gettimeofday()
program will return a backwards time[1] event usually within half-an-hour.  [I'm
just including this info here to point out that switching between clocksources
seems to cause some momentary instability.  Before anyone points this out I will
say that this not a "real world" bug.  I'm trying to find out if anyone actually
does switch from the tsc to hpet (and back) on multi-purposed systems.  I'm
hoping the answer to that is "no" :) ].

P.

[1] is not really a backwards time event.  The program does a comparison of
gettimeofday() values and returns an error if the new value is less than the
older value.  The gettimeofday() value jumps *forward* slightly for one returned
value when the clocksource switches from the tsc to the hpet (I have not see a
failure the other way around), and that causes the comparison to fail on the
next comparison of gettimeofday().  If I remove the code that returns an error
gettimeofday() returns to returning incrementally increasing values.  Typical
jumps are in the tens of microseconds range.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-18 23:59           ` Prarit Bhargava
@ 2012-04-19  0:18             ` John Stultz
  2012-04-19 11:56               ` Prarit Bhargava
  2012-04-19 12:50               ` Thomas Gleixner
  2012-04-19 12:37             ` Thomas Gleixner
  1 sibling, 2 replies; 22+ messages in thread
From: John Stultz @ 2012-04-19  0:18 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable

On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
>
> Hey John,
>
> Thanks for continuing to work on this.  Coincidentally that exact patch was my
> first attempt at resolving the problem as well.  The problem is that even after
> touching the clocksource watchdog and restoring irqs the printk buffer can take
> a LONG time to flush -- and that still will cause an overflow comparison.  So
> fixing it with just a touch_clocksource_watchdog() isn't the right thing to do
> IMO.  Maybe a combination of the printk() patch you suggested earlier and the
> touch_clocksource_watchdog() is the right way to go but I'll leave that up to
> tglx and yourself to decide on a correct fix.
:( That's a bummer. Something similar may be useful on the printk side 
as well.


> There's also some additional information that I've been gathering on this issue;
> I have seen *idle* systems switch to the hpet because the clocksource watchdog
> hits the overflow comparison.  As expected it happens much less frequently on
> newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based)
> due to the difference in shift values but it is happening in both cases.

Some of the recent adjustments for more robust shift calculations may 
partially be responsible for the improvement. Although I'm not sure why 
idle systems (that don't halt the TSC in idle) would trip this.  Do let 
me know if you find any particular way of reproducing this.

> The odd thing about this behaviour is that I would expect it to occur with the
> same frequency on small systems as it does on large systems with linux.git as
> the watchdog fires once/second.  AFAICT I do not see this on small systems but
> see it only on systems with greater than 24 cpus (both Intel and AMD).
Hrm.

> Using debug code similar to the dump code I previously provided, I can see that
> every so often these large systems can hit a case where the tsc wraps and the
> hpet is still monotonically increasing.  When the unstable calculation is
> performed the result is obviously affected by the overflow.  Sometimes this
> comparison overflow happens within 18 minutes, other times it can take hours or
> days.
TSC wraps? Are you sure that's what you see? Or do you have that 
switched? With the HPET wrapping?


> The other part of this puzzle is that if switch between the tsc and hpet every
> 10 seconds, and run a gettimeofday() comparison program, the gettimeofday()
> program will return a backwards time[1] event usually within half-an-hour.  [I'm
> just including this info here to point out that switching between clocksources
> seems to cause some momentary instability.  Before anyone points this out I will
> say that this not a "real world" bug.  I'm trying to find out if anyone actually
> does switch from the tsc to hpet (and back) on multi-purposed systems.  I'm
> hoping the answer to that is "no" :) ].
So, there were some recent fixes for 3.4 to address an issue 
specifically around inconsistencies at clocksource switch time:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a939e817aa7e199d2fff05a67cb745be32dd5c2d
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f695cf94837de53864180400cbac42cfa370426f

I definitely want to make sure any sort of inconsistencies like that are 
resolved. So let me know if you can still trigger anything like that 
with the latest 3.4 kernel.

thanks
-john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19  0:18             ` John Stultz
@ 2012-04-19 11:56               ` Prarit Bhargava
  2012-04-19 12:50               ` Thomas Gleixner
  1 sibling, 0 replies; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-19 11:56 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Thomas Gleixner, Salman Qazi, stable



On 04/18/2012 08:18 PM, John Stultz wrote:
> On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
>>
>> Hey John,
>>
>> Thanks for continuing to work on this.  Coincidentally that exact patch was my
>> first attempt at resolving the problem as well.  The problem is that even after
>> touching the clocksource watchdog and restoring irqs the printk buffer can take
>> a LONG time to flush -- and that still will cause an overflow comparison.  So
>> fixing it with just a touch_clocksource_watchdog() isn't the right thing to do
>> IMO.  Maybe a combination of the printk() patch you suggested earlier and the
>> touch_clocksource_watchdog() is the right way to go but I'll leave that up to
>> tglx and yourself to decide on a correct fix.
> :( That's a bummer. Something similar may be useful on the printk side as well.

Hmm ... I'll give that a shot.

> 
> 
>> There's also some additional information that I've been gathering on this issue;
>> I have seen *idle* systems switch to the hpet because the clocksource watchdog
>> hits the overflow comparison.  As expected it happens much less frequently on
>> newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based)
>> due to the difference in shift values but it is happening in both cases.
> 
> Some of the recent adjustments for more robust shift calculations may partially
> be responsible for the improvement. Although I'm not sure why idle systems (that
> don't halt the TSC in idle) would trip this.  Do let me know if you find any
> particular way of reproducing this.
> 
>> The odd thing about this behaviour is that I would expect it to occur with the
>> same frequency on small systems as it does on large systems with linux.git as
>> the watchdog fires once/second.  AFAICT I do not see this on small systems but
>> see it only on systems with greater than 24 cpus (both Intel and AMD).
> Hrm.

Yeah, it's odd.  I have no idea why more cpus makes any difference :/

> 
>> Using debug code similar to the dump code I previously provided, I can see that
>> every so often these large systems can hit a case where the tsc wraps and the
>> hpet is still monotonically increasing.  When the unstable calculation is
>> performed the result is obviously affected by the overflow.  Sometimes this
>> comparison overflow happens within 18 minutes, other times it can take hours or
>> days.
> TSC wraps? Are you sure that's what you see? Or do you have that switched? With
> the HPET wrapping?

Sorry, you're right -- the HPET wraps.  I mistyped that.

> 
> 
>> The other part of this puzzle is that if switch between the tsc and hpet every
>> 10 seconds, and run a gettimeofday() comparison program, the gettimeofday()
>> program will return a backwards time[1] event usually within half-an-hour.  [I'm
>> just including this info here to point out that switching between clocksources
>> seems to cause some momentary instability.  Before anyone points this out I will
>> say that this not a "real world" bug.  I'm trying to find out if anyone actually
>> does switch from the tsc to hpet (and back) on multi-purposed systems.  I'm
>> hoping the answer to that is "no" :) ].
> So, there were some recent fixes for 3.4 to address an issue specifically around
> inconsistencies at clocksource switch time:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a939e817aa7e199d2fff05a67cb745be32dd5c2d
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f695cf94837de53864180400cbac42cfa370426f
> 

AFAICT I have both of these in my tree.  It is linux-2.6.git as of
592fe8980688e7cba46897685d014c7fb3018a67.

I am doing

while (true)
do
        val=`ps aux | egrep $1 | wc -l`
        if [ $val -ne 2 ]; then
                exit 1
        fi
        echo "switching to tsc"
        echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

        sleep 10
        val=`ps aux | egrep $1 | wc -l`
        if [ $val -ne 2 ]; then
                exit 1
        fi
        echo "switching to hpet"
        echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
        sleep 10
done

where $1 is the pid of my gettimeofday() comparison test.  As I said, the test
exists when a "backwards" time event occurs so the script above also bails.

> 
> I definitely want to make sure any sort of inconsistencies like that are
> resolved. So let me know if you can still trigger anything like that with the
> latest 3.4 kernel.

I'll dig into this a bit more then -- I have a few things I want to investigate.
 I'll also try the touch_clocksource_watchdog() in the printk() code and get
back to in a few days.

P.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-18 23:59           ` Prarit Bhargava
  2012-04-19  0:18             ` John Stultz
@ 2012-04-19 12:37             ` Thomas Gleixner
  2012-04-19 12:51               ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-19 12:37 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable

On Wed, 18 Apr 2012, Prarit Bhargava wrote:
> There's also some additional information that I've been gathering on this issue;
> I have seen *idle* systems switch to the hpet because the clocksource watchdog
> hits the overflow comparison.  As expected it happens much less frequently on
> newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based)
> due to the difference in shift values but it is happening in both cases.
> 
> The odd thing about this behaviour is that I would expect it to occur with the
> same frequency on small systems as it does on large systems with linux.git as
> the watchdog fires once/second.  AFAICT I do not see this on small systems but
> see it only on systems with greater than 24 cpus (both Intel and AMD).
>
> Using debug code similar to the dump code I previously provided, I can see that
> every so often these large systems can hit a case where the tsc wraps and the
> hpet is still monotonically increasing.  When the unstable calculation is
> performed the result is obviously affected by the overflow.  Sometimes this
> comparison overflow happens within 18 minutes, other times it can take hours or
> days.

You are describing symptoms, but the root cause is obviously that the
watchdog does not get invoked in time. The question is why.

Can you please add the patch below and enable scheduler, timer and irq
events in the tracer. Tracing will stop once the watchdog triggers.

Please provide the traces. We need to understand the root cause of
this idle wreckage.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19  0:18             ` John Stultz
  2012-04-19 11:56               ` Prarit Bhargava
@ 2012-04-19 12:50               ` Thomas Gleixner
  2012-04-19 12:52                 ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-19 12:50 UTC (permalink / raw)
  To: John Stultz; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable

On Wed, 18 Apr 2012, John Stultz wrote:
> On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
> > 
> > Hey John,
> > 
> > Thanks for continuing to work on this.  Coincidentally that exact patch was
> > my
> > first attempt at resolving the problem as well.  The problem is that even
> > after
> > touching the clocksource watchdog and restoring irqs the printk buffer can
> > take
> > a LONG time to flush -- and that still will cause an overflow comparison.
> > So
> > fixing it with just a touch_clocksource_watchdog() isn't the right thing to
> > do
> > IMO.  Maybe a combination of the printk() patch you suggested earlier and
> > the
> > touch_clocksource_watchdog() is the right way to go but I'll leave that up
> > to
> > tglx and yourself to decide on a correct fix.
> :( That's a bummer. Something similar may be useful on the printk side as
> well.

No. The show_state() part prints into the buffer. But it's not
guaranteed that the buffer is flushed right away. It could be flushed
later as well in a different context. And of course the flush code
runs with interrupts disabled and dumping out a gazillion of lines
over serial will cause the same hickup. Just planting random
touch_watchdog() calls into the code is not the right approach,
really.

We should think about the reasons why we have interrupts disabled for
so much time. Is that really, really necessary ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 12:37             ` Thomas Gleixner
@ 2012-04-19 12:51               ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-19 12:51 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable

On Thu, 19 Apr 2012, Thomas Gleixner wrote:

> On Wed, 18 Apr 2012, Prarit Bhargava wrote:
> > There's also some additional information that I've been gathering on this issue;
> > I have seen *idle* systems switch to the hpet because the clocksource watchdog
> > hits the overflow comparison.  As expected it happens much less frequently on
> > newer kernels (linux.git top of tree) than older stable kernels (2.6.32 based)
> > due to the difference in shift values but it is happening in both cases.
> > 
> > The odd thing about this behaviour is that I would expect it to occur with the
> > same frequency on small systems as it does on large systems with linux.git as
> > the watchdog fires once/second.  AFAICT I do not see this on small systems but
> > see it only on systems with greater than 24 cpus (both Intel and AMD).
> >
> > Using debug code similar to the dump code I previously provided, I can see that
> > every so often these large systems can hit a case where the tsc wraps and the
> > hpet is still monotonically increasing.  When the unstable calculation is
> > performed the result is obviously affected by the overflow.  Sometimes this
> > comparison overflow happens within 18 minutes, other times it can take hours or
> > days.
> 
> You are describing symptoms, but the root cause is obviously that the
> watchdog does not get invoked in time. The question is why.
> 
> Can you please add the patch below and enable scheduler, timer and irq
> events in the tracer. Tracing will stop once the watchdog triggers.
> 
> Please provide the traces. We need to understand the root cause of
> this idle wreckage.
> 
> Thanks,
> 
> 	tglx

-ENOPATCH :) 

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c958338..2214323 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -287,11 +287,15 @@ static void clocksource_watchdog(unsigned long data)
 		cs->cs_last = csnow;
 		cs->wd_last = wdnow;
 
+		trace_printk("wd %lld %lld cs %lld %lld\n" , wdnow, wd_nsec,
+			     csnow, cs_nsec);
+
 		if (atomic_read(&watchdog_reset_pending))
 			continue;
 
 		/* Check the deviation from the watchdog clocksource. */
 		if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
+			tracing_off();
 			clocksource_unstable(cs, cs_nsec - wd_nsec);
 			continue;
 		}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 12:50               ` Thomas Gleixner
@ 2012-04-19 12:52                 ` Thomas Gleixner
  2012-04-19 13:06                   ` Prarit Bhargava
  2012-04-19 18:12                   ` John Stultz
  0 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-19 12:52 UTC (permalink / raw)
  To: John Stultz; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable

On Thu, 19 Apr 2012, Thomas Gleixner wrote:

> On Wed, 18 Apr 2012, John Stultz wrote:
> > On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
> > > 
> > > Hey John,
> > > 
> > > Thanks for continuing to work on this.  Coincidentally that exact patch was
> > > my
> > > first attempt at resolving the problem as well.  The problem is that even
> > > after
> > > touching the clocksource watchdog and restoring irqs the printk buffer can
> > > take
> > > a LONG time to flush -- and that still will cause an overflow comparison.
> > > So
> > > fixing it with just a touch_clocksource_watchdog() isn't the right thing to
> > > do
> > > IMO.  Maybe a combination of the printk() patch you suggested earlier and
> > > the
> > > touch_clocksource_watchdog() is the right way to go but I'll leave that up
> > > to
> > > tglx and yourself to decide on a correct fix.
> > :( That's a bummer. Something similar may be useful on the printk side as
> > well.
> 
> No. The show_state() part prints into the buffer. But it's not
> guaranteed that the buffer is flushed right away. It could be flushed
> later as well in a different context. And of course the flush code
> runs with interrupts disabled and dumping out a gazillion of lines
> over serial will cause the same hickup. Just planting random
> touch_watchdog() calls into the code is not the right approach,
> really.
> 
> We should think about the reasons why we have interrupts disabled for
> so much time. Is that really, really necessary ?

I'm not against making the clocksource code more robust, but I don't
want to add crap there just to cope with complete madness elsewhere.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 12:52                 ` Thomas Gleixner
@ 2012-04-19 13:06                   ` Prarit Bhargava
  2012-04-19 13:18                     ` Thomas Gleixner
  2012-04-19 18:12                   ` John Stultz
  1 sibling, 1 reply; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-19 13:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: John Stultz, linux-kernel, Salman Qazi, stable



On 04/19/2012 08:52 AM, Thomas Gleixner wrote:
> On Thu, 19 Apr 2012, Thomas Gleixner wrote:
> 
>> On Wed, 18 Apr 2012, John Stultz wrote:
>>> On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
>>>>

>> No. The show_state() part prints into the buffer. But it's not
>> guaranteed that the buffer is flushed right away. It could be flushed
>> later as well in a different context. And of course the flush code
>> runs with interrupts disabled and dumping out a gazillion of lines
>> over serial will cause the same hickup. Just planting random
>> touch_watchdog() calls into the code is not the right approach,
>> really.
>>
>> We should think about the reasons why we have interrupts disabled for
>> so much time. Is that really, really necessary ?

In the case of the sysrq-t, I would argue that it is.  The whole point behind
the sysrq-t is that we're capturing the *current* state of the system.  Having
that output effected by interrupts seems like a bad idea.

> 
> I'm not against making the clocksource code more robust, but I don't
> want to add crap there just to cope with complete madness elsewhere.

Maybe I came off the wrong way but I completely agree with that sentiment.  Like
yourself, I'm looking for a correct fix rather than a fast fix.

Sorry that I haven't provided any debug info but I'm still in the gathering data
stage atm.  It was just John's ping that made me "brain dump" the current info I
had.

P.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 13:06                   ` Prarit Bhargava
@ 2012-04-19 13:18                     ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2012-04-19 13:18 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: John Stultz, linux-kernel, Salman Qazi, stable

On Thu, 19 Apr 2012, Prarit Bhargava wrote:
> On 04/19/2012 08:52 AM, Thomas Gleixner wrote:
> > On Thu, 19 Apr 2012, Thomas Gleixner wrote:
> > 
> >> On Wed, 18 Apr 2012, John Stultz wrote:
> >>> On 04/18/2012 04:59 PM, Prarit Bhargava wrote:
> >>>>
> 
> >> No. The show_state() part prints into the buffer. But it's not
> >> guaranteed that the buffer is flushed right away. It could be flushed
> >> later as well in a different context. And of course the flush code
> >> runs with interrupts disabled and dumping out a gazillion of lines
> >> over serial will cause the same hickup. Just planting random
> >> touch_watchdog() calls into the code is not the right approach,
> >> really.
> >>
> >> We should think about the reasons why we have interrupts disabled for
> >> so much time. Is that really, really necessary ?
> 
> In the case of the sysrq-t, I would argue that it is.  The whole point behind
> the sysrq-t is that we're capturing the *current* state of the system.  Having
> that output effected by interrupts seems like a bad idea.

Nonsense. The system state can change, while we are dumping it unless
you run on an UP system. irq disable only affects the current CPU
nothing else, and nothing can prevent the other cpus to wake up, run,
exit, fork .....

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 12:52                 ` Thomas Gleixner
  2012-04-19 13:06                   ` Prarit Bhargava
@ 2012-04-19 18:12                   ` John Stultz
  2012-04-25 12:29                     ` Prarit Bhargava
  1 sibling, 1 reply; 22+ messages in thread
From: John Stultz @ 2012-04-19 18:12 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Prarit Bhargava, linux-kernel, Salman Qazi, stable

On 04/19/2012 05:52 AM, Thomas Gleixner wrote:
> On Thu, 19 Apr 2012, Thomas Gleixner wrote:
>
>> We should think about the reasons why we have interrupts disabled for
>> so much time. Is that really, really necessary ?
> I'm not against making the clocksource code more robust, but I don't
> want to add crap there just to cope with complete madness elsewhere.
>

Very much agreed.
-john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
  2012-04-19 18:12                   ` John Stultz
@ 2012-04-25 12:29                     ` Prarit Bhargava
  0 siblings, 0 replies; 22+ messages in thread
From: Prarit Bhargava @ 2012-04-25 12:29 UTC (permalink / raw)
  To: John Stultz; +Cc: Thomas Gleixner, linux-kernel, Salman Qazi, stable



On 04/19/2012 02:12 PM, John Stultz wrote:
> On 04/19/2012 05:52 AM, Thomas Gleixner wrote:
>> On Thu, 19 Apr 2012, Thomas Gleixner wrote:
>>
>>> We should think about the reasons why we have interrupts disabled for
>>> so much time. Is that really, really necessary ?
>> I'm not against making the clocksource code more robust, but I don't
>> want to add crap there just to cope with complete madness elsewhere.
>>
> 
> Very much agreed.

Hi John and Thomas,

After much analysis I have good news to report.  The good news is that the
problem with the random tsc failures was chased down to a script left running in
which sysrq-t's were executed over ping packets.  This, as I've previously
pointed out, can cause the tsc to be erroneously marked unstable.

[Aside: I hit myself with a big cluebat when I realized that all the failures
were occurring at the same wall-clock time, 3:00AM.  That couldn't be a
coincidence.]

I'm working with lwoodman to figure out a way to get rid of the locking (as
suggested by you Thomas) around the sysrq code.

P.

> -john
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-04-25 12:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-04 15:11 [PATCH] clocksource, prevent overflow in clocksource_cyc2ns Prarit Bhargava
2012-04-04 18:00 ` John Stultz
2012-04-04 18:33   ` Prarit Bhargava
2012-04-05  1:08     ` John Stultz
2012-04-05 11:00       ` Prarit Bhargava
2012-04-05 16:23         ` John Stultz
2012-04-05 12:27       ` Prarit Bhargava
2012-04-05 16:45         ` John Stultz
2012-04-06 23:29         ` Thomas Gleixner
2012-04-07 13:47           ` Prarit Bhargava
2012-04-18 23:20         ` John Stultz
2012-04-18 23:59           ` Prarit Bhargava
2012-04-19  0:18             ` John Stultz
2012-04-19 11:56               ` Prarit Bhargava
2012-04-19 12:50               ` Thomas Gleixner
2012-04-19 12:52                 ` Thomas Gleixner
2012-04-19 13:06                   ` Prarit Bhargava
2012-04-19 13:18                     ` Thomas Gleixner
2012-04-19 18:12                   ` John Stultz
2012-04-25 12:29                     ` Prarit Bhargava
2012-04-19 12:37             ` Thomas Gleixner
2012-04-19 12:51               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).