All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joel Becker <Joel.Becker@oracle.com>
To: Yury Polyanskiy <ypolyans@princeton.edu>
Cc: john stultz <johnstul@us.ibm.com>,
	linux-kernel@vger.kernel.org, Andrew Morton <akpm@osdl.org>,
	Jan Glauber <jan.glauber@de.ibm.com>
Subject: Re: [PATCH] hangcheck-timer is broken on x86
Date: Wed, 7 Apr 2010 17:52:28 -0700	[thread overview]
Message-ID: <20100408005226.GD4573@mail.oracle.com> (raw)
In-Reply-To: <20100329183414.2f9e3966@penta.localdomain>

On Mon, Mar 29, 2010 at 06:34:14PM -0400, Yury Polyanskiy wrote:
> On Mon, 29 Mar 2010 14:43:44 -0700
> john stultz <johnstul@us.ibm.com> wrote:
> > On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote:
> > > >> > What I'm saying is that if you're using getrawmonotonic() to detect
> > > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
> > > >> > continually increasing) if the timer interrupt is delayed. This does not
> > > >> > apply to systems using the TSC clocksource, but does apply to systems
> > > >> > using the acpi_pm.
> > And something else I thought of, while the TSC won't wrap, the
> > multiplication done to convert to nanoseconds will overflow when you hit
> > a large enough cycle delta. So even TSC systems are not guaranteed to
> > have timekeeping (and thus getrawmonotonic) work over infinite time
> > without accumulation.

	Ugh.

> Agreed (large clock->shift, right?), but for hangcheck-timer this
> would hardly be a problem, since such a large overflow very unlikely to
> land inside allowed interval around the pre-planned timer fire instant.

	But if you go beyond that interval...

> > You might also have some trouble with small intervals. Since things like
> > tickless systems or other advanced power-savings systems might try to
> > collate or push timers together to save battery. So ticks may be delayed
> > a small amount (timers are only guaranteed to fire AFTER the time
> > specified, there really is no promised bound on how late they may be).
> > 
> > Additionally, on -rt systems, you might have higher priority FIFO tasks
> > blocking the hangcheck timer from executing for a smallish amount of
> > time.
> 
> Yes, these are the events I want to see logged. Essentially I use
> hangcheck timer to check stability of kernel's heartbeat.

	Which is neat, but not the original reason for hangcheck.

> > Something to also consider might also be to look at the softlockup
> > watchdog, which is fairly similar but somewhat more deeply integrated
> > into the kernel. Maybe some of this could be merged?
> 
> Yeah, for softlockup detection, I don't understand why one would
> prefer hangcheck-timer to watchdog. I am sure Joel has some reasons
> though. For me read_persistent_clock() is not a solution, and others
> perhaps are indeed would be using softlockup watchdog, which leaves the
> decision to Joel.

	hangcheck originally was designed to kill a box as fast as
possible.  It comes out of the cluster environment.  Imagine you have
two machines, node1 and node2, working against a shared data store.
They coordinate their access via a lock manager.
	Then node2 goes out to lunch.  Maybe qla2xxx decides to
udelay() while waiting for an FC device.  Something like that.  After a
time period, node1 decides that node2 must have crashed.  It recovers
any intermediate state, then proceeds as if node2 is gone.
	Now the udelay() finally finishes and node2 starts working
again.  node2 does not know that node1 has continued without it.  It
will write old data to the shared storage, corrupting it.
	hangcheck-timer reduces this exposure significantly, because the
timer interrupt will fire reliably and quickly.  hangcheck-timer - if
using the right clock source - will notice the time discrepancy and
immediately trigger the reset.  Note that the reset is the only valid
solution here.  We can't wait for node2 to try to figure anything out;
old data might be already queued in the I/O layer.
	This is why hangcheck-timer must rely on wallclock time.
softdog was originally tried, but after a true hang (udelay(), PCI,
something with timer interrupts off) the system clock doesn't actually
notice the time change.  So the system might have been hung for 30
seconds, but the system clock thinks it has only been gone for 10.
Softdog won't fire, but hangcheck-timer will.  This is also why
suspend/resume has to be treated as a hang.

Joel

-- 

"The lawgiver, of all beings, most owes the law allegiance.  He of all
 men should behave as though the law compelled him.  But it is the
 universal weakness of mankind that what we are given to administer we
 presently imagine we own."
        - H.G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

      reply	other threads:[~2010-04-08  0:54 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-24  3:36 [PATCH] hangcheck-timer is broken on x86 Yury Polyanskiy
2010-03-26 21:24 ` Andrew Morton
2010-03-26 21:52   ` Yury Polyanskiy
2010-03-26 21:46 ` Joel Becker
2010-03-26 22:00   ` Yury Polyanskiy
2010-03-27  0:57     ` Joel Becker
2010-03-27  2:02       ` Yury Polyanskiy
2010-03-27 22:03         ` Joel Becker
2010-03-27 22:51           ` Yury Polyanskiy
2010-03-27 23:36             ` Joel Becker
2010-03-28  2:08               ` Yury Polyanskiy
2010-03-29  1:00   ` john stultz
2010-03-29 14:11     ` Yury Polyanskiy
2010-03-29 16:43       ` john stultz
2010-03-29 17:04         ` Yury Polyanskiy
2010-03-29 18:44           ` john stultz
2010-03-29 19:53             ` Joel Becker
2010-03-29 21:08             ` Yury Polyanskiy
2010-03-29 21:43               ` john stultz
2010-03-29 22:34                 ` Yury Polyanskiy
2010-04-08  0:52                   ` Joel Becker [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100408005226.GD4573@mail.oracle.com \
    --to=joel.becker@oracle.com \
    --cc=akpm@osdl.org \
    --cc=jan.glauber@de.ibm.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ypolyans@princeton.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.