All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] hangcheck-timer is broken on x86
@ 2010-03-24  3:36 Yury Polyanskiy
  2010-03-26 21:24 ` Andrew Morton
  2010-03-26 21:46 ` Joel Becker
  0 siblings, 2 replies; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-24  3:36 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

Dear Joel,

The drivers/char/hangcheck-timer.c is doubly broken. First, the
following line overflows unsigned long:
# define TIMER_FREQ (HZ*loops_per_jiffy)

Second, and more importantly, loops_per_jiffy has little to do with the conversion from the
the time scale of get_cycles() (aka rdtsc) to the time scale of jiffies.

The attached patch resolves both of the problems.

Best,
Yury

PS. I have CC-ed a few people who touched hangcheck-timer.c in 2006. Please CC me for follow-ups.

diff --git a/drivers/char/hangcheck-timer.c b/drivers/char/hangcheck-timer.c
index 712d9f2..c57a508 100644
--- a/drivers/char/hangcheck-timer.c
+++ b/drivers/char/hangcheck-timer.c
@@ -49,8 +49,9 @@
 #include <asm/uaccess.h>
 #include <linux/sysrq.h>
 #include <linux/timer.h>
+#include <linux/time.h>
 
-#define VERSION_STR "0.9.0"
+#define VERSION_STR "0.9.1"
 
 #define DEFAULT_IOFENCE_MARGIN 60	/* Default fudge factor, in seconds */
 #define DEFAULT_IOFENCE_TICK 180	/* Default timer timeout, in seconds */
@@ -119,10 +120,8 @@ __setup("hcheck_dump_tasks", hangcheck_parse_dump_tasks);
 #if defined(CONFIG_S390)
 # define HAVE_MONOTONIC
 # define TIMER_FREQ 1000000000ULL
-#elif defined(CONFIG_IA64)
-# define TIMER_FREQ ((unsigned long long)local_cpu_data->itc_freq)
 #else
-# define TIMER_FREQ (HZ*loops_per_jiffy)
+# define TIMER_FREQ 1000000000ULL
 #endif
 
 #ifdef HAVE_MONOTONIC
@@ -130,7 +129,9 @@ extern unsigned long long monotonic_clock(void);
 #else
 static inline unsigned long long monotonic_clock(void)
 {
-	return get_cycles();
+	struct timespec ts;
+	getrawmonotonic(&ts);
+	return timespec_to_ns(&ts);
 }
 #endif  /* HAVE_MONOTONIC */
 
@@ -168,6 +169,13 @@ static void hangcheck_fire(unsigned long data)
 			printk(KERN_CRIT "Hangcheck: hangcheck value past margin!\n");
 		}
 	}
+#if 0
+	/* 
+	 * Enable to investigate delays in detail
+	 */
+	printk("Hangcheck: called %Ld ns since last time (%Ld ns overshoot)\n", 
+			tsc_diff, tsc_diff - hangcheck_tick*TIMER_FREQ);
+#endif
 	mod_timer(&hangcheck_ticktock, jiffies + (hangcheck_tick*HZ));
 	hangcheck_tsc = monotonic_clock();
 }
@@ -180,7 +188,7 @@ static int __init hangcheck_init(void)
 #if defined (HAVE_MONOTONIC)
 	printk("Hangcheck: Using monotonic_clock().\n");
 #else
-	printk("Hangcheck: Using get_cycles().\n");
+	printk("Hangcheck: Using getrawmonotonic().\n");
 #endif  /* HAVE_MONOTONIC */
 	hangcheck_tsc_margin =
 		(unsigned long long)(hangcheck_margin + hangcheck_tick);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-24  3:36 [PATCH] hangcheck-timer is broken on x86 Yury Polyanskiy
@ 2010-03-26 21:24 ` Andrew Morton
  2010-03-26 21:52   ` Yury Polyanskiy
  2010-03-26 21:46 ` Joel Becker
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2010-03-26 21:24 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: Joel Becker, linux-kernel, john stultz, Jan Glauber

On Tue, 23 Mar 2010 23:36:11 -0400
Yury Polyanskiy <ypolyans@princeton.edu> wrote:

> The drivers/char/hangcheck-timer.c is doubly broken. First, the
> following line overflows unsigned long:
> # define TIMER_FREQ (HZ*loops_per_jiffy)
> 
> Second, and more importantly, loops_per_jiffy has little to do with the conversion from the
> the time scale of get_cycles() (aka rdtsc) to the time scale of jiffies.

It's a bit odd to have a driver be this broken on x86_32 for five years
without anyone noticing.  What are the user-visible effects of these
shortcomings?

Also, please do send us a Signed-off-by: for this patch, as explained
in Documentation/SubmittingPatches, thanks.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-24  3:36 [PATCH] hangcheck-timer is broken on x86 Yury Polyanskiy
  2010-03-26 21:24 ` Andrew Morton
@ 2010-03-26 21:46 ` Joel Becker
  2010-03-26 22:00   ` Yury Polyanskiy
  2010-03-29  1:00   ` john stultz
  1 sibling, 2 replies; 21+ messages in thread
From: Joel Becker @ 2010-03-26 21:46 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

On Tue, Mar 23, 2010 at 11:36:11PM -0400, Yury Polyanskiy wrote:
> Second, and more importantly, loops_per_jiffy has little to do with the conversion from the
> the time scale of get_cycles() (aka rdtsc) to the time scale of jiffies.

	It used to!  Fundamentally, of course, we didn't have a
monotonic clock everywhere that satisfied hangcheck-timer's needs.  So
we had to use different approaches on different architectures.

> @@ -130,7 +129,9 @@ extern unsigned long long monotonic_clock(void);
>  #else
>  static inline unsigned long long monotonic_clock(void)
>  {
> -	return get_cycles();
> +	struct timespec ts;
> +	getrawmonotonic(&ts);
> +	return timespec_to_ns(&ts);
>  }
>  #endif  /* HAVE_MONOTONIC */

	I have two questions:

1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
it always return the wallclock nanoseconds even in the face of CPU speed
changes, suspend, udelay, or any other suspension of kernel operation?
Yes, I know this is a tougher standard than rdtsc(), but this is what
hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.

2) If it does satisfy, why not use it for all hangcheck usage instead of
any ifdefs?

Joel

-- 

"The cynics are right nine times out of ten."  
        - H. L. Mencken

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-26 21:24 ` Andrew Morton
@ 2010-03-26 21:52   ` Yury Polyanskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-26 21:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joel Becker, linux-kernel, john stultz, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

On Fri, 26 Mar 2010 14:24:23 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 23 Mar 2010 23:36:11 -0400
> Yury Polyanskiy <ypolyans@princeton.edu> wrote:
> 
> > The drivers/char/hangcheck-timer.c is doubly broken. First, the
> > following line overflows unsigned long:
> > # define TIMER_FREQ (HZ*loops_per_jiffy)
> > 
> > Second, and more importantly, loops_per_jiffy has little to do with the conversion from the
> > the time scale of get_cycles() (aka rdtsc) to the time scale of jiffies.
> 
> It's a bit odd to have a driver be this broken on x86_32 for five years
> without anyone noticing.  What are the user-visible effects of these
> shortcomings?

When the overflown value of TIMER_FREQ is abnormally low, it spams the
syslog with KERN_CRIT messages "Hangcheck: hangcheck value past margin!"
But whether it happens or not depends on HZ and lpj in a complex way.
People have hit it occasionally as far as google search can tell. 

> 
> Also, please do send us a Signed-off-by: for this patch, as explained
> in Documentation/SubmittingPatches, thanks.
> 

Sorry.

Signed-off-by: Yury Polyanskiy <polyanskiy@gmail.com>

Thank you Andrew!

Yury

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-26 21:46 ` Joel Becker
@ 2010-03-26 22:00   ` Yury Polyanskiy
  2010-03-27  0:57     ` Joel Becker
  2010-03-29  1:00   ` john stultz
  1 sibling, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-26 22:00 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Fri, 26 Mar 2010 14:46:49 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:

> On Tue, Mar 23, 2010 at 11:36:11PM -0400, Yury Polyanskiy wrote:

> 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> it always return the wallclock nanoseconds even in the face of CPU speed
> changes, suspend, udelay, or any other suspension of kernel operation?
> Yes, I know this is a tougher standard than rdtsc(), but this is what
> hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.

Yes, as far as I can tell. Note that rdtsc is hosed on suspend-resume.

> 
> 2) If it does satisfy, why not use it for all hangcheck usage instead of
> any ifdefs?

On my part, I didn't want to touch the S390 code since I can't test it.

Yury

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-26 22:00   ` Yury Polyanskiy
@ 2010-03-27  0:57     ` Joel Becker
  2010-03-27  2:02       ` Yury Polyanskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Joel Becker @ 2010-03-27  0:57 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

On Fri, Mar 26, 2010 at 06:00:25PM -0400, Yury Polyanskiy wrote:
> On Fri, 26 Mar 2010 14:46:49 -0700
> Joel Becker <Joel.Becker@oracle.com> wrote:
> 
> > On Tue, Mar 23, 2010 at 11:36:11PM -0400, Yury Polyanskiy wrote:
> 
> > 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> > it always return the wallclock nanoseconds even in the face of CPU speed
> > changes, suspend, udelay, or any other suspension of kernel operation?
> > Yes, I know this is a tougher standard than rdtsc(), but this is what
> > hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.
> 
> Yes, as far as I can tell. Note that rdtsc is hosed on suspend-resume.

	Yeah, I know.  rdtsc hangcheck-timer really required no suspend
or cpufreq.  Since it is only really used by servers, this wasn't a
terrible restriction.  Then virtualization came along...

> > 2) If it does satisfy, why not use it for all hangcheck usage instead of
> > any ifdefs?
> 
> On my part, I didn't want to touch the S390 code since I can't test it.

	Makes sense.  Andrew, would you mind pushing this through your
tree?

Acked-by: Joel Becker <joel.becker@oracle.com>

Joel

-- 

Dort wo man Bücher verbrennt, verbrennt man am Ende auch Mensch.
(Wherever they burn books, they will also end up burning people.)
	- Heinrich Heine

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-27  0:57     ` Joel Becker
@ 2010-03-27  2:02       ` Yury Polyanskiy
  2010-03-27 22:03         ` Joel Becker
  0 siblings, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-27  2:02 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]

On Fri, 26 Mar 2010 17:57:11 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:

> On Fri, Mar 26, 2010 at 06:00:25PM -0400, Yury Polyanskiy wrote:
> > On Fri, 26 Mar 2010 14:46:49 -0700
> > Joel Becker <Joel.Becker@oracle.com> wrote:
> > 
> > > On Tue, Mar 23, 2010 at 11:36:11PM -0400, Yury Polyanskiy wrote:
> > 
> > > 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> > > it always return the wallclock nanoseconds even in the face of CPU speed
> > > changes, suspend, udelay, or any other suspension of kernel operation?
> > > Yes, I know this is a tougher standard than rdtsc(), but this is what
> > > hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.
> > 
> > Yes, as far as I can tell. Note that rdtsc is hosed on suspend-resume.
> 
> 	Yeah, I know.  rdtsc hangcheck-timer really required no suspend
> or cpufreq.  Since it is only really used by servers, this wasn't a
> terrible restriction.  Then virtualization came along...

Joel, just realized there is a slight mistake in what I said before.
getrawmonotonic() is a refined jiffies (and actually resolves to
get_cycles() on my system in the end). Thus it doesn't count while in
suspend. However, jiffies-based timers (aka timer-wheel) are also
stopped while in suspend. So getrawmonotonic() is the right call to
check the precision of the jiffies-based timer (i.e. you dont need to
make a correction by calling monotonic_to_bootbased()). 

So my comment about rdtsc and suspend-resume is wrong.

Yury

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-27  2:02       ` Yury Polyanskiy
@ 2010-03-27 22:03         ` Joel Becker
  2010-03-27 22:51           ` Yury Polyanskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Joel Becker @ 2010-03-27 22:03 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

On Fri, Mar 26, 2010 at 10:02:59PM -0400, Yury Polyanskiy wrote:
> Joel, just realized there is a slight mistake in what I said before.
> getrawmonotonic() is a refined jiffies (and actually resolves to
> get_cycles() on my system in the end). Thus it doesn't count while in
> suspend. However, jiffies-based timers (aka timer-wheel) are also
> stopped while in suspend. So getrawmonotonic() is the right call to
> check the precision of the jiffies-based timer (i.e. you dont need to
> make a correction by calling monotonic_to_bootbased()). 

	It's OK to tell hangcheck-timer users that suspend is not
allowed.  After all, you're running something that you don't want to see
hang.
	Is there a clock in the system that is a true wallclock?  I'm
guessing, since getrawmonotonic() is get_cycles() based, that it doesn't
provide accurate time in the face of cpufreq changes.  Is that true?

Joel



-- 

Life's Little Instruction Book #497

	"Go down swinging."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-27 22:03         ` Joel Becker
@ 2010-03-27 22:51           ` Yury Polyanskiy
  2010-03-27 23:36             ` Joel Becker
  0 siblings, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-27 22:51 UTC (permalink / raw)
  To: Yury Polyanskiy, linux-kernel, Andrew Morton, john stultz, Jan Glauber

On Sat, Mar 27, 2010 at 6:03 PM, Joel Becker <Joel.Becker@oracle.com> wrote:
> On Fri, Mar 26, 2010 at 10:02:59PM -0400, Yury Polyanskiy wrote:
>> Joel, just realized there is a slight mistake in what I said before.
>> getrawmonotonic() is a refined jiffies (and actually resolves to
>> get_cycles() on my system in the end). Thus it doesn't count while in
>> suspend. However, jiffies-based timers (aka timer-wheel) are also
>> stopped while in suspend. So getrawmonotonic() is the right call to
>> check the precision of the jiffies-based timer (i.e. you dont need to
>> make a correction by calling monotonic_to_bootbased()).
>
>        It's OK to tell hangcheck-timer users that suspend is not
> allowed.  After all, you're running something that you don't want to see
> hang.

Joel, what I am saying is exactly the opposite: it is totally ok to
suspend-resume with hangcheck-timer (jiffies are stopped and so is
getrawmonotonic() when system suspended).

>        Is there a clock in the system that is a true wallclock?  I'm
> guessing, since getrawmonotonic() is get_cycles() based, that it doesn't
> provide accurate time in the face of cpufreq changes.  Is that true?

Of course, getrawmonotonic accounts for cpufreq changes (see
arch/x86/kernel/tsc.c:time_cpufreq_notifier()).


Yury

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-27 22:51           ` Yury Polyanskiy
@ 2010-03-27 23:36             ` Joel Becker
  2010-03-28  2:08               ` Yury Polyanskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Joel Becker @ 2010-03-27 23:36 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

On Sat, Mar 27, 2010 at 06:51:01PM -0400, Yury Polyanskiy wrote:
> >        It's OK to tell hangcheck-timer users that suspend is not
> > allowed.  After all, you're running something that you don't want to see
> > hang.
> 
> Joel, what I am saying is exactly the opposite: it is totally ok to
> suspend-resume with hangcheck-timer (jiffies are stopped and so is
> getrawmonotonic() when system suspended).

	Nope.  The point of hangcheck-timer is that it reboots should
the system not be running for a certain amountof time.  If
suspend-resume is allowed, a system can resume after days and think it
wasn't more than a second.  hangcheck-timer will not know to reboot.

> >        Is there a clock in the system that is a true wallclock?  I'm
> > guessing, since getrawmonotonic() is get_cycles() based, that it doesn't
> > provide accurate time in the face of cpufreq changes.  Is that true?
> 
> Of course, getrawmonotonic accounts for cpufreq changes (see
> arch/x86/kernel/tsc.c:time_cpufreq_notifier()).

	Excellent!  That's a definite improvement over raw get_cycles().

Joel

-- 

Life's Little Instruction Book #182

	"Be romantic."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-27 23:36             ` Joel Becker
@ 2010-03-28  2:08               ` Yury Polyanskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-28  2:08 UTC (permalink / raw)
  To: Joel Becker; +Cc: linux-kernel, Andrew Morton, john stultz, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

On Sat, 27 Mar 2010 16:36:30 -0700
Joel Becker <Joel.Becker@oracle.com> wrote:

> On Sat, Mar 27, 2010 at 06:51:01PM -0400, Yury Polyanskiy wrote:
> > >        It's OK to tell hangcheck-timer users that suspend is not
> > > allowed.  After all, you're running something that you don't want to see
> > > hang.
> > 
> > Joel, what I am saying is exactly the opposite: it is totally ok to
> > suspend-resume with hangcheck-timer (jiffies are stopped and so is
> > getrawmonotonic() when system suspended).
> 
> 	Nope.  The point of hangcheck-timer is that it reboots should
> the system not be running for a certain amountof time.  If
> suspend-resume is allowed, a system can resume after days and think it
> wasn't more than a second.  hangcheck-timer will not know to reboot.

But what is the reason for rebooting? Hangcheck is supposed to reboot
the machine only if the timer handler was run too late. However,
jiffies-based timers DO NOT count time spent in suspend.

Y

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-26 21:46 ` Joel Becker
  2010-03-26 22:00   ` Yury Polyanskiy
@ 2010-03-29  1:00   ` john stultz
  2010-03-29 14:11     ` Yury Polyanskiy
  1 sibling, 1 reply; 21+ messages in thread
From: john stultz @ 2010-03-29  1:00 UTC (permalink / raw)
  To: Joel Becker; +Cc: Yury Polyanskiy, linux-kernel, Andrew Morton, Jan Glauber

On Fri, 2010-03-26 at 14:46 -0700, Joel Becker wrote:
> On Tue, Mar 23, 2010 at 11:36:11PM -0400, Yury Polyanskiy wrote:
> > Second, and more importantly, loops_per_jiffy has little to do with the conversion from the
> > the time scale of get_cycles() (aka rdtsc) to the time scale of jiffies.
> 
> 	It used to!  Fundamentally, of course, we didn't have a
> monotonic clock everywhere that satisfied hangcheck-timer's needs.  So
> we had to use different approaches on different architectures.
> 
> > @@ -130,7 +129,9 @@ extern unsigned long long monotonic_clock(void);
> >  #else
> >  static inline unsigned long long monotonic_clock(void)
> >  {
> > -	return get_cycles();
> > +	struct timespec ts;
> > +	getrawmonotonic(&ts);
> > +	return timespec_to_ns(&ts);
> >  }
> >  #endif  /* HAVE_MONOTONIC */
> 
> 	I have two questions:
> 
> 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> it always return the wallclock nanoseconds even in the face of CPU speed
> changes, suspend, udelay, or any other suspension of kernel operation?
> Yes, I know this is a tougher standard than rdtsc(), but this is what
> hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.

getrawmonotonic() can be stalled and will wrap on some hardware (acpi pm
timer wraps every 5 seconds).

I think the read_persistent_clock() is really what you want to use here.
On arches that do not support it, it returns 0 every time. 

thanks
-john




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29  1:00   ` john stultz
@ 2010-03-29 14:11     ` Yury Polyanskiy
  2010-03-29 16:43       ` john stultz
  0 siblings, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-29 14:11 UTC (permalink / raw)
  To: john stultz; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Sun, 28 Mar 2010 18:00:36 -0700
john stultz <johnstul@us.ibm.com> wrote:

> > 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> > it always return the wallclock nanoseconds even in the face of CPU speed
> > changes, suspend, udelay, or any other suspension of kernel operation?
> > Yes, I know this is a tougher standard than rdtsc(), but this is what
> > hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.
> 
> getrawmonotonic() can be stalled and will wrap on some hardware (acpi pm
> timer wraps every 5 seconds).
> 

I am not sure which archs do you mean. But in any case,
getrawmonotonic() is not just a wrap around a call to rdtsc() (or acpi
pm timer read). It is based on the clock->raw_time, which is updated
every timer interrupt by the update_wall_time(). So even if underlying
timer wraps, it doesn't lead to getrawmonotonic() returning 0 sec.


Y

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 14:11     ` Yury Polyanskiy
@ 2010-03-29 16:43       ` john stultz
  2010-03-29 17:04         ` Yury Polyanskiy
  0 siblings, 1 reply; 21+ messages in thread
From: john stultz @ 2010-03-29 16:43 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

On Mon, 2010-03-29 at 10:11 -0400, Yury Polyanskiy wrote:
> On Sun, 28 Mar 2010 18:00:36 -0700
> john stultz <johnstul@us.ibm.com> wrote:
> 
> > > 1) Does getrawmonotonic() satisfy hangcheck-timer?  What I mean is, will
> > > it always return the wallclock nanoseconds even in the face of CPU speed
> > > changes, suspend, udelay, or any other suspension of kernel operation?
> > > Yes, I know this is a tougher standard than rdtsc(), but this is what
> > > hangcheck-timer wants.  rdtsc() at least satisfied udelay and PCI hangs.
> > 
> > getrawmonotonic() can be stalled and will wrap on some hardware (acpi pm
> > timer wraps every 5 seconds).
> > 
> 
> I am not sure which archs do you mean. But in any case,
> getrawmonotonic() is not just a wrap around a call to rdtsc() (or acpi
> pm timer read). It is based on the clock->raw_time, which is updated
> every timer interrupt by the update_wall_time(). So even if underlying
> timer wraps, it doesn't lead to getrawmonotonic() returning 0 sec.

What I'm saying is that if you're using getrawmonotonic() to detect
hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
continually increasing) if the timer interrupt is delayed. This does not
apply to systems using the TSC clocksource, but does apply to systems
using the acpi_pm. 

read_persistent_clock() is likely a better interface to use, because it
returns the seconds usually from the CMOS clock which runs continually
without any OS maintenance. 

The only complication there would likely be hwclock syncing (either by
hand or via NTP), but that could be handled by a
touch_hangcheck_watchdog() style notifier. 

thanks
-john



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 16:43       ` john stultz
@ 2010-03-29 17:04         ` Yury Polyanskiy
  2010-03-29 18:44           ` john stultz
  0 siblings, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-29 17:04 UTC (permalink / raw)
  To: john stultz; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

On Mon, 29 Mar 2010 09:43:27 -0700
john stultz <johnstul@us.ibm.com> wrote:

> > I am not sure which archs do you mean. But in any case,
> > getrawmonotonic() is not just a wrap around a call to rdtsc() (or acpi
> > pm timer read). It is based on the clock->raw_time, which is updated
> > every timer interrupt by the update_wall_time(). So even if underlying
> > timer wraps, it doesn't lead to getrawmonotonic() returning 0 sec.  
> 
> What I'm saying is that if you're using getrawmonotonic() to detect
> hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> continually increasing) if the timer interrupt is delayed. This does not
> apply to systems using the TSC clocksource, but does apply to systems
> using the acpi_pm. 

But if timer interrupt is delayed by more than acpi_pm wrap-around
time, then the update_wall_time() is also screwed. Since it is not, we
can rely on getrawmonotonic().

Y

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 17:04         ` Yury Polyanskiy
@ 2010-03-29 18:44           ` john stultz
  2010-03-29 19:53             ` Joel Becker
  2010-03-29 21:08             ` Yury Polyanskiy
  0 siblings, 2 replies; 21+ messages in thread
From: john stultz @ 2010-03-29 18:44 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

On Mon, 2010-03-29 at 13:04 -0400, Yury Polyanskiy wrote:
> On Mon, 29 Mar 2010 09:43:27 -0700
> john stultz <johnstul@us.ibm.com> wrote:
> 
> > > I am not sure which archs do you mean. But in any case,
> > > getrawmonotonic() is not just a wrap around a call to rdtsc() (or acpi
> > > pm timer read). It is based on the clock->raw_time, which is updated
> > > every timer interrupt by the update_wall_time(). So even if underlying
> > > timer wraps, it doesn't lead to getrawmonotonic() returning 0 sec.  
> > 
> > What I'm saying is that if you're using getrawmonotonic() to detect
> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> > continually increasing) if the timer interrupt is delayed. This does not
> > apply to systems using the TSC clocksource, but does apply to systems
> > using the acpi_pm. 
> 
> But if timer interrupt is delayed by more than acpi_pm wrap-around
> time, then the update_wall_time() is also screwed. Since it is not, we
> can rely on getrawmonotonic().

Right, if the box hangs for longer then the clocksource can count for,
the timekeeping subsystem will be off by some multiple of that length.

And That's exactly why I'm advising against using
gettimeofday/getrawmonotonic or any other software managed sense of time
for the hangcheck timer, as you won't be able to correctly detect hangs.

I'm also suggesting using something like read_persistent_clock() is
better, because there is no OS/software management involved (other then
the minor syncing issue I mentioned before) so if the system hangs for a
long period of time, then returns, you'll still be able to detect the
hang.

But maybe what folks are using the hangcheck timer for is shifting, so
its possible that I'm not quite understanding what you're trying to do
here.

thanks
-john


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 18:44           ` john stultz
@ 2010-03-29 19:53             ` Joel Becker
  2010-03-29 21:08             ` Yury Polyanskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Joel Becker @ 2010-03-29 19:53 UTC (permalink / raw)
  To: john stultz; +Cc: Yury Polyanskiy, linux-kernel, Andrew Morton, Jan Glauber

On Mon, Mar 29, 2010 at 11:44:51AM -0700, john stultz wrote:
> > But if timer interrupt is delayed by more than acpi_pm wrap-around
> > time, then the update_wall_time() is also screwed. Since it is not, we
> > can rely on getrawmonotonic().
> 
> Right, if the box hangs for longer then the clocksource can count for,
> the timekeeping subsystem will be off by some multiple of that length.
> 
> And That's exactly why I'm advising against using
> gettimeofday/getrawmonotonic or any other software managed sense of time
> for the hangcheck timer, as you won't be able to correctly detect hangs.
> 
> I'm also suggesting using something like read_persistent_clock() is
> better, because there is no OS/software management involved (other then
> the minor syncing issue I mentioned before) so if the system hangs for a
> long period of time, then returns, you'll still be able to detect the
> hang.
> 
> But maybe what folks are using the hangcheck timer for is shifting, so
> its possible that I'm not quite understanding what you're trying to do
> here.

	The people who use hangcheck-timer for the reasons I originally
wrote it absolutely want any hang, including long ones, detected.

Joel

-- 

"For every complex problem there exists a solution that is brief,
     concise, and totally wrong."
                                        -Unknown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 18:44           ` john stultz
  2010-03-29 19:53             ` Joel Becker
@ 2010-03-29 21:08             ` Yury Polyanskiy
  2010-03-29 21:43               ` john stultz
  1 sibling, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-29 21:08 UTC (permalink / raw)
  To: john stultz; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

>> > What I'm saying is that if you're using getrawmonotonic() to detect
>> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
>> > continually increasing) if the timer interrupt is delayed. This does not
>> > apply to systems using the TSC clocksource, but does apply to systems
>> > using the acpi_pm.
>>
>> But if timer interrupt is delayed by more than acpi_pm wrap-around
>> time, then the update_wall_time() is also screwed. Since it is not, we
>> can rely on getrawmonotonic().
>
> Right, if the box hangs for longer then the clocksource can count for,
> the timekeeping subsystem will be off by some multiple of that length.
>

Oh, I see. You mean that getrawmonotonic() wouldn't work under
abnormal conditions. I understand now, sorry for the confusion. You
are correct, of course.

I personally don't like the idea of relying on read_persistent_clock()
not only because of hwclock and ntp. In fact, my core interest in
hangcheck-timer is to set a very low margin (1 to 3 jiffies for
example) so that I would get a log message upon any kernel slow down
or a tick-miss (as a hardware integrity check). I don't think
read_persistent_clock() is precise enough for this purpose, is it?

Also, hooking to ntp update code complicates an otherwise simple
driver. I propose to simply check on non-S390 if the clock source
resolves to something other than TSC and dump a warning message on
driver load (something like "Hangcheck: kernel using clocksource %s,
which is not reliable for hang detection").

What do you think about it?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 21:08             ` Yury Polyanskiy
@ 2010-03-29 21:43               ` john stultz
  2010-03-29 22:34                 ` Yury Polyanskiy
  0 siblings, 1 reply; 21+ messages in thread
From: john stultz @ 2010-03-29 21:43 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote:
> >> > What I'm saying is that if you're using getrawmonotonic() to detect
> >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> >> > continually increasing) if the timer interrupt is delayed. This does not
> >> > apply to systems using the TSC clocksource, but does apply to systems
> >> > using the acpi_pm.
> >>
> >> But if timer interrupt is delayed by more than acpi_pm wrap-around
> >> time, then the update_wall_time() is also screwed. Since it is not, we
> >> can rely on getrawmonotonic().
> >
> > Right, if the box hangs for longer then the clocksource can count for,
> > the timekeeping subsystem will be off by some multiple of that length.
> >
> 
> Oh, I see. You mean that getrawmonotonic() wouldn't work under
> abnormal conditions. I understand now, sorry for the confusion. You
> are correct, of course.

And something else I thought of, while the TSC won't wrap, the
multiplication done to convert to nanoseconds will overflow when you hit
a large enough cycle delta. So even TSC systems are not guaranteed to
have timekeeping (and thus getrawmonotonic) work over infinite time
without accumulation.

We try to establish this length via timekeeping_max_deferment(), so that
we make sure we don't go into tickless mode for longer then the
clocksource can handle.


> I personally don't like the idea of relying on read_persistent_clock()
> not only because of hwclock and ntp. In fact, my core interest in
> hangcheck-timer is to set a very low margin (1 to 3 jiffies for
> example) so that I would get a log message upon any kernel slow down
> or a tick-miss (as a hardware integrity check). I don't think
> read_persistent_clock() is precise enough for this purpose, is it?

read_persistent_clock is a bit coarse, so for small intervals it would
not do. However, the current timeout range for the hangcheck timer is in
seconds, which should be fine for read_persistent_clock().

You might also have some trouble with small intervals. Since things like
tickless systems or other advanced power-savings systems might try to
collate or push timers together to save battery. So ticks may be delayed
a small amount (timers are only guaranteed to fire AFTER the time
specified, there really is no promised bound on how late they may be).

Additionally, on -rt systems, you might have higher priority FIFO tasks
blocking the hangcheck timer from executing for a smallish amount of
time.


> Also, hooking to ntp update code complicates an otherwise simple
> driver. I propose to simply check on non-S390 if the clock source
> resolves to something other than TSC and dump a warning message on
> driver load (something like "Hangcheck: kernel using clocksource %s,
> which is not reliable for hang detection").

That requires the hangcheck code to parse the current clocksource, which
might change as the system runs, so it also has to track the clocksource
over time. So I'm not sure its that much easier of a solution.

Something to also consider might also be to look at the softlockup
watchdog, which is fairly similar but somewhat more deeply integrated
into the kernel. Maybe some of this could be merged?

thanks
-john


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 21:43               ` john stultz
@ 2010-03-29 22:34                 ` Yury Polyanskiy
  2010-04-08  0:52                   ` Joel Becker
  0 siblings, 1 reply; 21+ messages in thread
From: Yury Polyanskiy @ 2010-03-29 22:34 UTC (permalink / raw)
  To: john stultz; +Cc: Joel Becker, linux-kernel, Andrew Morton, Jan Glauber

[-- Attachment #1: Type: text/plain, Size: 2825 bytes --]

On Mon, 29 Mar 2010 14:43:44 -0700
john stultz <johnstul@us.ibm.com> wrote:

> On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote:
> > >> > What I'm saying is that if you're using getrawmonotonic() to detect
> > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> > >> > continually increasing) if the timer interrupt is delayed. This does not
> > >> > apply to systems using the TSC clocksource, but does apply to systems
> > >> > using the acpi_pm.
> And something else I thought of, while the TSC won't wrap, the
> multiplication done to convert to nanoseconds will overflow when you hit
> a large enough cycle delta. So even TSC systems are not guaranteed to
> have timekeeping (and thus getrawmonotonic) work over infinite time
> without accumulation.

Agreed (large clock->shift, right?), but for hangcheck-timer this
would hardly be a problem, since such a large overflow very unlikely to
land inside allowed interval around the pre-planned timer fire instant.

> 
> You might also have some trouble with small intervals. Since things like
> tickless systems or other advanced power-savings systems might try to
> collate or push timers together to save battery. So ticks may be delayed
> a small amount (timers are only guaranteed to fire AFTER the time
> specified, there really is no promised bound on how late they may be).
> 
> Additionally, on -rt systems, you might have higher priority FIFO tasks
> blocking the hangcheck timer from executing for a smallish amount of
> time.

Yes, these are the events I want to see logged. Essentially I use
hangcheck timer to check stability of kernel's heartbeat.

> > Also, hooking to ntp update code complicates an otherwise simple
> > driver. I propose to simply check on non-S390 if the clock source
> > resolves to something other than TSC and dump a warning message on
> > driver load (something like "Hangcheck: kernel using clocksource %s,
> > which is not reliable for hang detection").
> 
> That requires the hangcheck code to parse the current clocksource, which
> might change as the system runs, so it also has to track the clocksource
> over time. So I'm not sure its that much easier of a solution.

Oh, shoot, you are right. So if compiled-in it would always complain.

> Something to also consider might also be to look at the softlockup
> watchdog, which is fairly similar but somewhat more deeply integrated
> into the kernel. Maybe some of this could be merged?

Yeah, for softlockup detection, I don't understand why one would
prefer hangcheck-timer to watchdog. I am sure Joel has some reasons
though. For me read_persistent_clock() is not a solution, and others
perhaps are indeed would be using softlockup watchdog, which leaves the
decision to Joel.

Best,
Y

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] hangcheck-timer is broken on x86
  2010-03-29 22:34                 ` Yury Polyanskiy
@ 2010-04-08  0:52                   ` Joel Becker
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Becker @ 2010-04-08  0:52 UTC (permalink / raw)
  To: Yury Polyanskiy; +Cc: john stultz, linux-kernel, Andrew Morton, Jan Glauber

On Mon, Mar 29, 2010 at 06:34:14PM -0400, Yury Polyanskiy wrote:
> On Mon, 29 Mar 2010 14:43:44 -0700
> john stultz <johnstul@us.ibm.com> wrote:
> > On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote:
> > > >> > What I'm saying is that if you're using getrawmonotonic() to detect
> > > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> > > >> > continually increasing) if the timer interrupt is delayed. This does not
> > > >> > apply to systems using the TSC clocksource, but does apply to systems
> > > >> > using the acpi_pm.
> > And something else I thought of, while the TSC won't wrap, the
> > multiplication done to convert to nanoseconds will overflow when you hit
> > a large enough cycle delta. So even TSC systems are not guaranteed to
> > have timekeeping (and thus getrawmonotonic) work over infinite time
> > without accumulation.

	Ugh.

> Agreed (large clock->shift, right?), but for hangcheck-timer this
> would hardly be a problem, since such a large overflow very unlikely to
> land inside allowed interval around the pre-planned timer fire instant.

	But if you go beyond that interval...

> > You might also have some trouble with small intervals. Since things like
> > tickless systems or other advanced power-savings systems might try to
> > collate or push timers together to save battery. So ticks may be delayed
> > a small amount (timers are only guaranteed to fire AFTER the time
> > specified, there really is no promised bound on how late they may be).
> > 
> > Additionally, on -rt systems, you might have higher priority FIFO tasks
> > blocking the hangcheck timer from executing for a smallish amount of
> > time.
> 
> Yes, these are the events I want to see logged. Essentially I use
> hangcheck timer to check stability of kernel's heartbeat.

	Which is neat, but not the original reason for hangcheck.

> > Something to also consider might also be to look at the softlockup
> > watchdog, which is fairly similar but somewhat more deeply integrated
> > into the kernel. Maybe some of this could be merged?
> 
> Yeah, for softlockup detection, I don't understand why one would
> prefer hangcheck-timer to watchdog. I am sure Joel has some reasons
> though. For me read_persistent_clock() is not a solution, and others
> perhaps are indeed would be using softlockup watchdog, which leaves the
> decision to Joel.

	hangcheck originally was designed to kill a box as fast as
possible.  It comes out of the cluster environment.  Imagine you have
two machines, node1 and node2, working against a shared data store.
They coordinate their access via a lock manager.
	Then node2 goes out to lunch.  Maybe qla2xxx decides to
udelay() while waiting for an FC device.  Something like that.  After a
time period, node1 decides that node2 must have crashed.  It recovers
any intermediate state, then proceeds as if node2 is gone.
	Now the udelay() finally finishes and node2 starts working
again.  node2 does not know that node1 has continued without it.  It
will write old data to the shared storage, corrupting it.
	hangcheck-timer reduces this exposure significantly, because the
timer interrupt will fire reliably and quickly.  hangcheck-timer - if
using the right clock source - will notice the time discrepancy and
immediately trigger the reset.  Note that the reset is the only valid
solution here.  We can't wait for node2 to try to figure anything out;
old data might be already queued in the I/O layer.
	This is why hangcheck-timer must rely on wallclock time.
softdog was originally tried, but after a true hang (udelay(), PCI,
something with timer interrupts off) the system clock doesn't actually
notice the time change.  So the system might have been hung for 30
seconds, but the system clock thinks it has only been gone for 10.
Softdog won't fire, but hangcheck-timer will.  This is also why
suspend/resume has to be treated as a hang.

Joel

-- 

"The lawgiver, of all beings, most owes the law allegiance.  He of all
 men should behave as though the law compelled him.  But it is the
 universal weakness of mankind that what we are given to administer we
 presently imagine we own."
        - H.G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-04-08  0:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-24  3:36 [PATCH] hangcheck-timer is broken on x86 Yury Polyanskiy
2010-03-26 21:24 ` Andrew Morton
2010-03-26 21:52   ` Yury Polyanskiy
2010-03-26 21:46 ` Joel Becker
2010-03-26 22:00   ` Yury Polyanskiy
2010-03-27  0:57     ` Joel Becker
2010-03-27  2:02       ` Yury Polyanskiy
2010-03-27 22:03         ` Joel Becker
2010-03-27 22:51           ` Yury Polyanskiy
2010-03-27 23:36             ` Joel Becker
2010-03-28  2:08               ` Yury Polyanskiy
2010-03-29  1:00   ` john stultz
2010-03-29 14:11     ` Yury Polyanskiy
2010-03-29 16:43       ` john stultz
2010-03-29 17:04         ` Yury Polyanskiy
2010-03-29 18:44           ` john stultz
2010-03-29 19:53             ` Joel Becker
2010-03-29 21:08             ` Yury Polyanskiy
2010-03-29 21:43               ` john stultz
2010-03-29 22:34                 ` Yury Polyanskiy
2010-04-08  0:52                   ` Joel Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.