linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Multithreaded core dumps
@ 2001-08-30  5:21 Elan Feingold
  2001-08-30  5:50 ` Kip Macy
  2001-08-30  8:16 ` Multithreaded core dumps (CLONE_THREAD and elf) Terje Eggestad
  0 siblings, 2 replies; 9+ messages in thread
From: Elan Feingold @ 2001-08-30  5:21 UTC (permalink / raw)
  To: linux-kernel

Hi,

First post (although lurking on and off since 0.99pl14 or so), so please
be gentle.

We recently convinced my company to move from VxWorks to (embedded)
Linux.  For better or worse, our application is heavily multithreaded.
It seems that current versions of Linux dump single-threaded core.  I've
done a bit of research and it seems that:

- GDB is more than happy to load multiple register sets per core file.

- There are patches to the kernel dump multiple core files, one per LWP.
This is not really helpful in my case, since we're on an embedded
platform with little Flash to store cores.  Besides, loading up gdb on
10-20 core files looking for the one that had the problem doesn't sound
very fun, as opposed to saying "info threads".

Because I am (not so) young and (very) foolish, it doesn't sound that
hard, at first (and uneducated) glance, to dump register sets for all
related LWPs to a single core file, even in a portable way across x86
and PPC architectures.  Under SMP, it might be a bit trickier, but
Solaris manages to do it, so it can't be impossible, and capturing any
state for each thread would seem better than nothing at all, since all
the LWPs are about to die anyways.  I'm considering using some of my
(copious) spare time and trying to patch the kernel to do this.  I have
a few questions:

0. Am I wrong or confused about the state of postmortem multithreaded
debugging under Linux?

1. Is anyone else working on this?  If not, why not?  Am I missing
something?

2. If this is simply something that nobody is working on because other
things are more interesting, can anybody give me a few pointers on where
to start?

Best regards,

-elan
Aspiring Kernel Hacker


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  5:21 Multithreaded core dumps Elan Feingold
@ 2001-08-30  5:50 ` Kip Macy
  2001-08-30  8:24   ` Alan Cox
  2001-08-30  8:16 ` Multithreaded core dumps (CLONE_THREAD and elf) Terje Eggestad
  1 sibling, 1 reply; 9+ messages in thread
From: Kip Macy @ 2001-08-30  5:50 UTC (permalink / raw)
  To: Elan Feingold; +Cc: linux-kernel

> 
> 0. Am I wrong or confused about the state of postmortem multithreaded
> debugging under Linux?

At least as of mid-2.2 series this was certainly my experience. It was
very frustrating that the thread/process that dumped core was not the one
that dereferenced a bad pointer/failed an assert but the process group
leader.

> 
> 2. If this is simply something that nobody is working on because other
> things are more interesting, can anybody give me a few pointers on where
> to start?
> 

I am inclined to believe that that is the case. Unfortunately, I have no
advice to give - but I am writing because I think that it would be neat if
you have the time and the inclination for you to document your findings as
you progress and put them on the web. 

			-Kip


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps (CLONE_THREAD and elf)
  2001-08-30  5:21 Multithreaded core dumps Elan Feingold
  2001-08-30  5:50 ` Kip Macy
@ 2001-08-30  8:16 ` Terje Eggestad
  2001-08-30  9:19   ` Andreas Dilger
  1 sibling, 1 reply; 9+ messages in thread
From: Terje Eggestad @ 2001-08-30  8:16 UTC (permalink / raw)
  To: Elan Feingold; +Cc: linux-kernel

Annyoing isn't it....

I just browsed thru the code, got curious myself since Linux has a very
flexible way of specifying the degree of sharing. If two procs share VM,
it is not necessarily said that both shall exit if one do. 

THere is a CLONE_THREAD flag to clone() that sets up a linked list thru
all the procs (shared VM or not) that in 2.4.3 that I currently run
don't seem to be ready for general use, managed to get this:
te       31504 31504  0 10:03 pts/0    00:00:00 [clone2 <defunct>]
te       31505 31504  0 10:03 pts/0    00:00:00 [clone2 <defunct>]

Where a zombie is waiting for the parent to receive it's SIGCHLD, but
it's its own parent. Kinda cute, guess Its time to reboot....

I don't know what kind of behavour it's supposed to have, but if it's
intended to to mark all the clones to die with the coredumping proc, the
data structures are properly in place, and all do_coredump need to do
is to follow the linked list (thread_group), prempth if SMP, and read of
the regs. (plus all the pitfall I don't see of course). 


A COMPLETELY different matter is weither the elf format support multiple
registers sets.... look it up, (I'm not...)

TJ

Den 30 Aug 2001 00:21:06 -0500, skrev Elan Feingold:
> Hi,
> 
> First post (although lurking on and off since 0.99pl14 or so), so please
> be gentle.
> 
> We recently convinced my company to move from VxWorks to (embedded)
> Linux.  For better or worse, our application is heavily multithreaded.
> It seems that current versions of Linux dump single-threaded core.  I've
> done a bit of research and it seems that:
> 
> - GDB is more than happy to load multiple register sets per core file.
> 
> - There are patches to the kernel dump multiple core files, one per LWP.
> This is not really helpful in my case, since we're on an embedded
> platform with little Flash to store cores.  Besides, loading up gdb on
> 10-20 core files looking for the one that had the problem doesn't sound
> very fun, as opposed to saying "info threads".
> 
> Because I am (not so) young and (very) foolish, it doesn't sound that
> hard, at first (and uneducated) glance, to dump register sets for all
> related LWPs to a single core file, even in a portable way across x86
> and PPC architectures.  Under SMP, it might be a bit trickier, but
> Solaris manages to do it, so it can't be impossible, and capturing any
> state for each thread would seem better than nothing at all, since all
> the LWPs are about to die anyways.  I'm considering using some of my
> (copious) spare time and trying to patch the kernel to do this.  I have
> a few questions:
> 
> 0. Am I wrong or confused about the state of postmortem multithreaded
> debugging under Linux?
> 
> 1. Is anyone else working on this?  If not, why not?  Am I missing
> something?
> 
> 2. If this is simply something that nobody is working on because other
> things are more interesting, can anybody give me a few pointers on where
> to start?
> 
> Best regards,
> 
> -elan
> Aspiring Kernel Hacker
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
_________________________________________________________________________

Terje Eggestad                  terje.eggestad@scali.no
Scali Scalable Linux Systems    http://www.scali.com

Olaf Helsets Vei 6              tel:    +47 22 62 89 61 (OFFICE)
P.O.Box 70 Bogerud                      +47 975 31 574  (MOBILE)
N-0621 Oslo                     fax:    +47 22 62 89 51
NORWAY            
_________________________________________________________________________


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  5:50 ` Kip Macy
@ 2001-08-30  8:24   ` Alan Cox
  2001-08-30  9:03     ` Alexander Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Cox @ 2001-08-30  8:24 UTC (permalink / raw)
  To: Kip Macy; +Cc: Elan Feingold, linux-kernel

> I am inclined to believe that that is the case. Unfortunately, I have no
> advice to give - but I am writing because I think that it would be neat if
> you have the time and the inclination for you to document your findings as
> you progress and put them on the web. 

The 2.4-ac tree supports dumping core.$pid for the threads that actually
died

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  8:24   ` Alan Cox
@ 2001-08-30  9:03     ` Alexander Viro
  2001-08-30  9:12       ` Alan Cox
  2001-08-30  9:33       ` David S. Miller
  0 siblings, 2 replies; 9+ messages in thread
From: Alexander Viro @ 2001-08-30  9:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kip Macy, Elan Feingold, linux-kernel



On Thu, 30 Aug 2001, Alan Cox wrote:

> > I am inclined to believe that that is the case. Unfortunately, I have no
> > advice to give - but I am writing because I think that it would be neat if
> > you have the time and the inclination for you to document your findings as
> > you progress and put them on the web. 
> 
> The 2.4-ac tree supports dumping core.$pid for the threads that actually
> died

... and these dumps are not reliable.  Living thread may modify the
contents of dump as it's being written out.  I.e. you are getting
false alarms - inconsistent data that was never there.

Think of a linked list protected by a mutex.  Half of its entries are
already written out.  Surviving thread removes an element.  It updates
the in-core structures correctly.  The problem being, in the dump
we get part of memory from before that change and part - after.  If
you notice that when you are looking at the dump - welcome to a nice
chase after the bug that never existed.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  9:03     ` Alexander Viro
@ 2001-08-30  9:12       ` Alan Cox
  2001-08-30 10:08         ` Alexander Viro
  2001-08-30  9:33       ` David S. Miller
  1 sibling, 1 reply; 9+ messages in thread
From: Alan Cox @ 2001-08-30  9:12 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Alan Cox, Kip Macy, Elan Feingold, linux-kernel

> > The 2.4-ac tree supports dumping core.$pid for the threads that actually
> > died
> 
> ... and these dumps are not reliable.  Living thread may modify the
> contents of dump as it's being written out.  I.e. you are getting
> false alarms - inconsistent data that was never there.

That is mathematically insoluble. Think about an SMP system, you cannot stop
the other thread in instantaneously small time

Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps (CLONE_THREAD and elf)
  2001-08-30  8:16 ` Multithreaded core dumps (CLONE_THREAD and elf) Terje Eggestad
@ 2001-08-30  9:19   ` Andreas Dilger
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2001-08-30  9:19 UTC (permalink / raw)
  To: Terje Eggestad; +Cc: Elan Feingold, linux-kernel

On Aug 30, 2001  10:16 +0200, Terje Eggestad wrote:
> THere is a CLONE_THREAD flag to clone() that sets up a linked list thru
> all the procs (shared VM or not) that in 2.4.3 that I currently run
> don't seem to be ready for general use, managed to get this:
> te       31504 31504  0 10:03 pts/0    00:00:00 [clone2 <defunct>]
> te       31505 31504  0 10:03 pts/0    00:00:00 [clone2 <defunct>]
> 
> Where a zombie is waiting for the parent to receive it's SIGCHLD, but
> it's its own parent. Kinda cute, guess Its time to reboot....

I'm pretty sure Linus had a patch for this in 2.4.7 or so, which
reparented the thread to init, so it would be reaped on exit.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  9:03     ` Alexander Viro
  2001-08-30  9:12       ` Alan Cox
@ 2001-08-30  9:33       ` David S. Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David S. Miller @ 2001-08-30  9:33 UTC (permalink / raw)
  To: alan; +Cc: viro, kmacy, efeingold, linux-kernel

   From: Alan Cox <alan@lxorguk.ukuu.org.uk>
   Date: Thu, 30 Aug 2001 10:12:49 +0100 (BST)

   > ... and these dumps are not reliable.  Living thread may modify the
   > contents of dump as it's being written out.  I.e. you are getting
   > false alarms - inconsistent data that was never there.
   
   That is mathematically insoluble. Think about an SMP system, you cannot stop
   the other thread in instantaneously small time

If you mean "at the time the user process trapped", pretty much this
is true.

But it _IS_ possible to capture the other threads with an SMP cross
call at the time the bad condition leading to the dump is detected.

	thread_dont_schedule(t);
	smp_reschedule();
	t->dump_core(t);
	thread_re_schedule(t);

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multithreaded core dumps
  2001-08-30  9:12       ` Alan Cox
@ 2001-08-30 10:08         ` Alexander Viro
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Viro @ 2001-08-30 10:08 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kip Macy, Elan Feingold, linux-kernel



On Thu, 30 Aug 2001, Alan Cox wrote:

> > > The 2.4-ac tree supports dumping core.$pid for the threads that actually
> > > died
> > 
> > ... and these dumps are not reliable.  Living thread may modify the
> > contents of dump as it's being written out.  I.e. you are getting
> > false alarms - inconsistent data that was never there.
> 
> That is mathematically insoluble. Think about an SMP system, you cannot stop
> the other thread in instantaneously small time

That's a separate problem - we can't catch the exact state of process
at the moment of death, indeed, but in theory we could make sure that
no changes happen while we write the thing out.  As it is, coredump is not
only taken at some point after death, it doesn't correspond to state of
the thing at _any_ moment.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-08-30 10:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-30  5:21 Multithreaded core dumps Elan Feingold
2001-08-30  5:50 ` Kip Macy
2001-08-30  8:24   ` Alan Cox
2001-08-30  9:03     ` Alexander Viro
2001-08-30  9:12       ` Alan Cox
2001-08-30 10:08         ` Alexander Viro
2001-08-30  9:33       ` David S. Miller
2001-08-30  8:16 ` Multithreaded core dumps (CLONE_THREAD and elf) Terje Eggestad
2001-08-30  9:19   ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).