All of lore.kernel.org
 help / color / mirror / Atom feed
* [parisc-linux] Re: Expect defunct, kill -9 panics kernel?
       [not found] <119aab440702100916q504101b1xe99f65ff5945e712@mail.gmail.com>
@ 2007-02-10 18:10 ` John David Anglin
  2007-02-10 18:35 ` [parisc-linux] " James Bottomley
  1 sibling, 0 replies; 7+ messages in thread
From: John David Anglin @ 2007-02-10 18:10 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: dave.anglin, parisc-linux

> Is this the usual behaviour you see?
> 
> 1. I run the gcc testsuite.
> 2. expect dies, leaving a defunct process.
> 3. Killing another expect panics the kernel.

It similar to the behavior that I see.  I don't usually see this
with expect though.  Possibly, this is because I use my own build
of expect linked tcl8.3.

I see this behavior quite consistently on my c3750 if I

1.  Run the gcc libjava testsuite.
2.  Usually, there a set of processes (e.g., Process_3) left running
    after the testsuite ends.  These processes are not defunct and
    load the processor.  I can kill all but the oldest thread.
3.  Killing the oldest thread panics the kernel.  Sometimes the system
    reboots.  However, the system often hangs doing endless panics.

I suspect a timing issue as the c3750 is the fastest processor that
I test on.  I don't see as many problems with the libjava testsuite
on slower hardware.  At one time, I thought this might be a 32 versus
64-bit issue, but I see the same problems running a 64-bit kernel.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found] <119aab440702100916q504101b1xe99f65ff5945e712@mail.gmail.com>
  2007-02-10 18:10 ` [parisc-linux] Re: Expect defunct, kill -9 panics kernel? John David Anglin
@ 2007-02-10 18:35 ` James Bottomley
  1 sibling, 0 replies; 7+ messages in thread
From: James Bottomley @ 2007-02-10 18:35 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, parisc-linux

On Sat, 2007-02-10 at 12:16 -0500, Carlos O'Donell wrote:
> At what point in the process life are we in __wake_up and
> __wake_up_common?
> An address of 0x10 is very suspicious.

Almost every internal kernel event or semaphore uses these.

Because of the empty backtrace, I'd be inclined to say it was the
scheduler, possibly.

0x10 looks to be curr->func implying curr is NULL and thus the queue
task_list is corrupt.

That's the best I can do without the kernel to pull apart.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found] <1171226106.3406.47.camel@mulgrave.il.steeleye.com>
@ 2007-02-11 20:59 ` John David Anglin
  0 siblings, 0 replies; 7+ messages in thread
From: John David Anglin @ 2007-02-11 20:59 UTC (permalink / raw)
  To: James Bottomley; +Cc: dave.anglin, parisc-linux

> Right, now here's a bit of really useful detective work:
> 
> In the same piece of disassembly can you see what happens to %r26 ...
> the first argument to __wake_up_common() which is the wait queue?  It
> may be clobbered, but if it isn't by the time we fault we know that
> 0x45f10250 is the address of the wait queue.  If we're incredibly lucky,
> it's a symbol in the vmlinux, can you see if it is (and if it's valid)?

In the code I'm looking at, r26 is copied to r7 near the beginning of
__wake_up_common().  r7 is 0 in the register dump.  Of course, Carlos'
kernel may differ.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found]         ` <119aab440702111222v3562f308v9808b4dea7b73d59@mail.gmail.com>
@ 2007-02-11 20:35           ` James Bottomley
  0 siblings, 0 replies; 7+ messages in thread
From: James Bottomley @ 2007-02-11 20:35 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, dave.anglin, parisc-linux

On Sun, 2007-02-11 at 15:22 -0500, Carlos O'Donell wrote:
> On 2/11/07, Carlos O'Donell <carlos@systemhalted.org> wrote:
> > The faulting instruction is:
> >   74:   52 82 00 20     ldd 10(r20),rp
> >
> > Which is just before the curr->func call.
> >   78:   e8 40 f0 00     bve,l (rp),rp
> >   7c:   52 9b 00 30     ldd 18(r20),dp
> >
> > So your assumption was correct. The value of curr->func is null.
> > How did the list get corrupted?
> 
> ... to be precise, the faulting instruction is the break at 0x10 that
> we use for null pointer dereferences.

Right, now here's a bit of really useful detective work:

In the same piece of disassembly can you see what happens to %r26 ...
the first argument to __wake_up_common() which is the wait queue?  It
may be clobbered, but if it isn't by the time we fault we know that
0x45f10250 is the address of the wait queue.  If we're incredibly lucky,
it's a symbol in the vmlinux, can you see if it is (and if it's valid)?

Knowing what the wait queue is will tell us (hopefully) with precision
where the fault lies.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found]   ` <119aab440702110909r2018a297k98b4f1baed54821a@mail.gmail.com>
  2007-02-11 17:17     ` John David Anglin
@ 2007-02-11 19:19     ` James Bottomley
       [not found]     ` <1171221592.3406.32.camel@mulgrave.il.steeleye.com>
  2 siblings, 0 replies; 7+ messages in thread
From: James Bottomley @ 2007-02-11 19:19 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: John David Anglin, dave.anglin, parisc-linux

On Sun, 2007-02-11 at 12:09 -0500, Carlos O'Donell wrote:
> How do I validate your guess? Look for a null or bogus curr->func when
> scheduling?

Disassemble the piece in vmlinux for __wait_common and check that the
instruction that faulted is where the code gets the curr->func.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found]   ` <119aab440702110909r2018a297k98b4f1baed54821a@mail.gmail.com>
@ 2007-02-11 17:17     ` John David Anglin
  2007-02-11 19:19     ` James Bottomley
       [not found]     ` <1171221592.3406.32.camel@mulgrave.il.steeleye.com>
  2 siblings, 0 replies; 7+ messages in thread
From: John David Anglin @ 2007-02-11 17:17 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: James.Bottomley, dave.anglin, parisc-linux

> On 2/10/07, James Bottomley <James.Bottomley@steeleye.com> wrote:
> > On Sat, 2007-02-10 at 14:37 -0500, John David Anglin wrote:
> > > > 0x10 looks to be curr->func implying curr is NULL and thus the queue
> > > > task_list is corrupt.
> > >
> > > Do you think it help to add a check in __wake_up for a NULL pointer?
> >
> > I suppose so ... I'd really like someone to validate my guess though,
> > although an additional BUG_ON() can't hurt.
> 
> How do I validate your guess? Look for a null or bogus curr->func when
> scheduling?

I'm trying the change below.  Hasn't triggered yet.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

diff --git a/kernel/sched.c b/kernel/sched.c
index cca93cc..277e426 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3703,6 +3703,7 @@ void fastcall __wake_up(wait_queue_head_t *q, unsigned int mode,
 {
 	unsigned long flags;
 
+	BUG_ON(!q);
 	spin_lock_irqsave(&q->lock, flags);
 	__wake_up_common(q, mode, nr_exclusive, 0, key);
 	spin_unlock_irqrestore(&q->lock, flags);
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [parisc-linux] Expect defunct, kill -9 panics kernel?
       [not found] <200702101937.l1AJb7Uo014941@hiauly1.hia.nrc.ca>
@ 2007-02-11  1:50 ` James Bottomley
       [not found] ` <1171158607.3373.54.camel@mulgrave.il.steeleye.com>
  1 sibling, 0 replies; 7+ messages in thread
From: James Bottomley @ 2007-02-11  1:50 UTC (permalink / raw)
  To: John David Anglin; +Cc: dave.anglin, parisc-linux

On Sat, 2007-02-10 at 14:37 -0500, John David Anglin wrote:
> > 0x10 looks to be curr->func implying curr is NULL and thus the queue
> > task_list is corrupt.
> 
> Do you think it help to add a check in __wake_up for a NULL pointer?

I suppose so ... I'd really like someone to validate my guess though,
although an additional BUG_ON() can't hurt.

James


_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-02-11 20:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <119aab440702100916q504101b1xe99f65ff5945e712@mail.gmail.com>
2007-02-10 18:10 ` [parisc-linux] Re: Expect defunct, kill -9 panics kernel? John David Anglin
2007-02-10 18:35 ` [parisc-linux] " James Bottomley
     [not found] <200702101937.l1AJb7Uo014941@hiauly1.hia.nrc.ca>
2007-02-11  1:50 ` James Bottomley
     [not found] ` <1171158607.3373.54.camel@mulgrave.il.steeleye.com>
     [not found]   ` <119aab440702110909r2018a297k98b4f1baed54821a@mail.gmail.com>
2007-02-11 17:17     ` John David Anglin
2007-02-11 19:19     ` James Bottomley
     [not found]     ` <1171221592.3406.32.camel@mulgrave.il.steeleye.com>
     [not found]       ` <119aab440702111221k19b2643em26ac943399274b9f@mail.gmail.com>
     [not found]         ` <119aab440702111222v3562f308v9808b4dea7b73d59@mail.gmail.com>
2007-02-11 20:35           ` James Bottomley
     [not found] <1171226106.3406.47.camel@mulgrave.il.steeleye.com>
2007-02-11 20:59 ` John David Anglin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.