All of lore.kernel.org
 help / color / mirror / Atom feed
* recursive fault in 2.6.35.5
@ 2011-05-29 16:27 Whit Blauvelt
  2011-05-30  2:48 ` Mike Galbraith
  0 siblings, 1 reply; 4+ messages in thread
From: Whit Blauvelt @ 2011-05-29 16:27 UTC (permalink / raw)
  To: linux-kernel

Hi,

This isn't a most-recent kernel, so we should upgrade the systems with it,
but it could also be useful to know why the fault occurred. If someone here
can easily decode the final messages when the system froze....

This is vanilla 2.6.35.5, built from source, running with Ubuntu Server
10.04.2. Two similar systems have been running stably for months, then
yesterday and today both froze up - one twice. On the one where I was able
to get a remote console before rebooting the final messages are in a screen
capture at

http://www.transpect.com/jpg/sb2crash.jpg

The final lines are

[3521437.065988] RIP  [<ffffffff81054ddc>] set_next_entity+0xc/0xa0
[3521437.065993]  RSP <ffff8801b60b1748>
[3521437.065994] CR2: 0000000000000038
[3521437.065997] ---[ end trace 5a40c5f226029029 ]---
[3521437.065999] Fixing recursive fault but reboot is needed!

These are basically file servers running NFS, samba, and some Python. I know
there are recent improvements to the kernel's NFS functions. Does this point
in that direction as the cause of the recursive fault?

TIA,
Whit

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: recursive fault in 2.6.35.5
  2011-05-29 16:27 recursive fault in 2.6.35.5 Whit Blauvelt
@ 2011-05-30  2:48 ` Mike Galbraith
  2011-05-31 14:24   ` Whit Blauvelt
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Galbraith @ 2011-05-30  2:48 UTC (permalink / raw)
  To: Whit Blauvelt; +Cc: linux-kernel

On Sun, 2011-05-29 at 12:27 -0400, Whit Blauvelt wrote:
> Hi,
> 
> This isn't a most-recent kernel, so we should upgrade the systems with it,
> but it could also be useful to know why the fault occurred. If someone here
> can easily decode the final messages when the system froze....
> 
> This is vanilla 2.6.35.5, built from source, running with Ubuntu Server
> 10.04.2. Two similar systems have been running stably for months, then
> yesterday and today both froze up - one twice. On the one where I was able
> to get a remote console before rebooting the final messages are in a screen
> capture at
> 
> http://www.transpect.com/jpg/sb2crash.jpg
> 
> The final lines are
> 
> [3521437.065988] RIP  [<ffffffff81054ddc>] set_next_entity+0xc/0xa0
> [3521437.065993]  RSP <ffff8801b60b1748>
> [3521437.065994] CR2: 0000000000000038
> [3521437.065997] ---[ end trace 5a40c5f226029029 ]---
> [3521437.065999] Fixing recursive fault but reboot is needed!
> 
> These are basically file servers running NFS, samba, and some Python. I know
> there are recent improvements to the kernel's NFS functions. Does this point
> in that direction as the cause of the recursive fault?

No, you've been bitten by an annoyingly elusive load balancing bug.

	-Mike


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: recursive fault in 2.6.35.5
  2011-05-30  2:48 ` Mike Galbraith
@ 2011-05-31 14:24   ` Whit Blauvelt
  2011-06-01  2:01     ` Mike Galbraith
  0 siblings, 1 reply; 4+ messages in thread
From: Whit Blauvelt @ 2011-05-31 14:24 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel

On Mon, May 30, 2011 at 04:48:29AM +0200, Mike Galbraith wrote:

> No, you've been bitten by an annoyingly elusive load balancing bug.

Thanks Mike. Can that bug be avoided by leaving out some kernel option? The
system that happened on had it's identical twin fail the day before. For
both, it was a time of relatively more load (although not excessive). On the
twin we didn't look at the console before rebooting though.

On the other hand, we'd run for months with no problem up until this.

Regards,
Whit


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: recursive fault in 2.6.35.5
  2011-05-31 14:24   ` Whit Blauvelt
@ 2011-06-01  2:01     ` Mike Galbraith
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Galbraith @ 2011-06-01  2:01 UTC (permalink / raw)
  To: Whit Blauvelt; +Cc: linux-kernel

On Tue, 2011-05-31 at 10:24 -0400, Whit Blauvelt wrote:
> On Mon, May 30, 2011 at 04:48:29AM +0200, Mike Galbraith wrote:
> 
> > No, you've been bitten by an annoyingly elusive load balancing bug.
> 
> Thanks Mike. Can that bug be avoided by leaving out some kernel option? The
> system that happened on had it's identical twin fail the day before. For
> both, it was a time of relatively more load (although not excessive). On the
> twin we didn't look at the console before rebooting though.
> 
> On the other hand, we'd run for months with no problem up until this.

No earthly notion.  I never figured out exactly how it happens.  Setting
traps for the critter didn't worked out.  I did receive some diagnostic
info from a group of ppc64 boxen that indicated that the clock went
backward, but when I zeroed in on it, it they went silent.  All other
machines with traps set have been totally silent for months (that's a
lot of machines too).

Bug seems to be dead upstream, at least I haven't noticed any reports
with a recent kernel.

	-Mike


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-06-01  2:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-29 16:27 recursive fault in 2.6.35.5 Whit Blauvelt
2011-05-30  2:48 ` Mike Galbraith
2011-05-31 14:24   ` Whit Blauvelt
2011-06-01  2:01     ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.