RE: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
@ 2002-05-14 16:38 Gross, Mark
  2002-05-15  6:37 ` Vamsi Krishna S .
  0 siblings, 1 reply; 17+ messages in thread
From: Gross, Mark @ 2002-05-14 16:38 UTC (permalink / raw)
  To: 'Erich Focht', Gross, Mark
  Cc: Linus Torvalds, linux-kernel, Vamsi Krishna S ., 'Bharata B Rao'

[-- Attachment #1: Type: text/plain, Size: 2116 bytes --]

See attached unit test code.  its not very pretty...

These are NOT exhaustive tests, yet they are a reasonable attempt at unit
testing / exercising the feature to check for stability issues.  My stress
test was to induce core dumps in these test programs while running the IBM
chat room benchmark.  The XMM.c program was written by Rao Bharata as part
of the 2.4.17 tcore testing.  I don't remember who wrote test.c, but ptest.c
is my fault.

I know that the i386 elf core file note sections only contain the class of
register data that's restored by __switch_to.  So I suppose a kernel thread,
like the migration_thread, or ksoftirq "could" dump core and GDB could do a
bt on such a dump.  However; these note sections only contain any data that
can be accessed from a non-privileged processor modes and your mileage will
vary.

--mgross

> -----Original Message-----
> From: Erich Focht [mailto:efocht@ess.nec.de]
> Sent: Tuesday, May 14, 2002 8:36 AM
> To: mark.gross@intel.com
> Cc: Linus Torvalds; linux-kernel@vger.kernel.org; Vamsi Krishna S .
> Subject: Re: PATCH Multithreaded core dump support for the 2.5.14 (and
> 15) kernel.
> 
> 
> Hi Mark!
> 
> Thanks for sending the new patch, I'd be interested in the 
> testprograms :-)
> 
> BTW: any idea what happens when a thread which is suspended 
> happens to be in 
> kernel mode? Guess this could be possible with 2.5.X... Does 
> gdb handle that?
> 
> Regards,
> Erich
> 
> On Monday 13 May 2002 21:17, you wrote:
> > The following patch for 2.5.14 kernel, applies cleanly to the 2.5.15
> > kernel.
> >
> > This work has been tested on the 2.5.14 kernel using a few pthread
> > applications to dump core, from SIGQUIT and SIGSEV. This 
> unit test has been
> > done on both 2 and 4 way systems.  Further, some stress 
> testing has been
> > done where, the core files have been created while the 
> system is under
> > schedule stress from the chat room benchmark running while 
> creating the
> > core files.  This implementation seems to be quit stable 
> under a busy
> > scheduler, YMMV.  These test programs are available uppon request ;)
> 
> 
> 


[-- Attachment #2: chatroom.tar.gz --]
[-- Type: application/octet-stream, Size: 15253 bytes --]

[-- Attachment #3: mpdbg.tar.gz --]
[-- Type: application/octet-stream, Size: 2998 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-14 16:38 PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel Gross, Mark
@ 2002-05-15  6:37 ` Vamsi Krishna S .
  2002-05-15 14:04   ` Pavel Machek
  0 siblings, 1 reply; 17+ messages in thread
From: Vamsi Krishna S . @ 2002-05-15  6:37 UTC (permalink / raw)
  To: Gross, Mark
  Cc: 'Erich Focht',
	Linus Torvalds, linux-kernel, 'Bharata B Rao'

Erich,

To respond to your specific question, if a thread happens to be in 
kernel mode when some other thread is dumping core (capturing
register state of other threads, to be more accurate) then
we would capture the _user mode_ register of that thread from the
bottom of it's kernel stack. GDB will show back trace untill the
thread entered kernel (int 0x80), eip will be pointing to the
instruction after the system call (return address).

-- 
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5262355 Extn: 3959
Internet: vamsi@in.ibm.com

On Tue, May 14, 2002 at 09:38:28AM -0700, Gross, Mark wrote:
> See attached unit test code.  its not very pretty...
> 
> These are NOT exhaustive tests, yet they are a reasonable attempt at unit
> testing / exercising the feature to check for stability issues.  My stress
> test was to induce core dumps in these test programs while running the IBM
> chat room benchmark.  The XMM.c program was written by Rao Bharata as part
> of the 2.4.17 tcore testing.  I don't remember who wrote test.c, but ptest.c
> is my fault.
> 
> I know that the i386 elf core file note sections only contain the class of
> register data that's restored by __switch_to.  So I suppose a kernel thread,
> like the migration_thread, or ksoftirq "could" dump core and GDB could do a
> bt on such a dump.  However; these note sections only contain any data that
> can be accessed from a non-privileged processor modes and your mileage will
> vary.
> 
> --mgross
> 
> > -----Original Message-----
> > From: Erich Focht [mailto:efocht@ess.nec.de]
> > Sent: Tuesday, May 14, 2002 8:36 AM
> > To: mark.gross@intel.com
> > Cc: Linus Torvalds; linux-kernel@vger.kernel.org; Vamsi Krishna S .
> > Subject: Re: PATCH Multithreaded core dump support for the 2.5.14 (and
> > 15) kernel.
> > 
> > 
> > Hi Mark!
> > 
> > Thanks for sending the new patch, I'd be interested in the 
> > testprograms :-)
> > 
> > BTW: any idea what happens when a thread which is suspended 
> > happens to be in 
> > kernel mode? Guess this could be possible with 2.5.X... Does 
> > gdb handle that?
> > 
> > Regards,
> > Erich
> > 
> > On Monday 13 May 2002 21:17, you wrote:
> > > The following patch for 2.5.14 kernel, applies cleanly to the 2.5.15
> > > kernel.
> > >
> > > This work has been tested on the 2.5.14 kernel using a few pthread
> > > applications to dump core, from SIGQUIT and SIGSEV. This 
> > unit test has been
> > > done on both 2 and 4 way systems.  Further, some stress 
> > testing has been
> > > done where, the core files have been created while the 
> > system is under
> > > schedule stress from the chat room benchmark running while 
> > creating the
> > > core files.  This implementation seems to be quit stable 
> > under a busy
> > > scheduler, YMMV.  These test programs are available uppon request ;)
> > 
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-15  6:37 ` Vamsi Krishna S .
@ 2002-05-15 14:04   ` Pavel Machek
  2002-05-15 20:53     ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Machek @ 2002-05-15 14:04 UTC (permalink / raw)
  To: Vamsi Krishna S .
  Cc: Gross, Mark, 'Erich Focht',
	Linus Torvalds, linux-kernel, 'Bharata B Rao'

Hi!

> To respond to your specific question, if a thread happens to be in 
> kernel mode when some other thread is dumping core (capturing
> register state of other threads, to be more accurate) then
> we would capture the _user mode_ register of that thread from the
> bottom of it's kernel stack. GDB will show back trace untill the
> thread entered kernel (int 0x80), eip will be pointing to the
> instruction after the system call (return address).

Okay, what about:

Thread 1 is in kernel and holds lock A. You need lock A to dump state.
When you move 1 to phantom runqueue, you loose ability to get A and
deadlock.

What prevents that?
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-15 14:04   ` Pavel Machek
@ 2002-05-15 20:53     ` Mark Gross
  2002-05-16 10:11       ` Pavel Machek
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Gross @ 2002-05-15 20:53 UTC (permalink / raw)
  To: Pavel Machek, Vamsi Krishna S .
  Cc: Gross, Mark, 'Erich Focht',
	Linus Torvalds, linux-kernel, 'Bharata B Rao'

On Wednesday 15 May 2002 10:04 am, Pavel Machek wrote:
> Okay, what about:
>
> Thread 1 is in kernel and holds lock A. You need lock A to dump state.
> When you move 1 to phantom runqueue, you loose ability to get A and
> deadlock.
>
> What prevents that?

Any pending tasklet / bottom half + top half get processes by the real CPU's 
even thought the I/O bound process may have been moved to the phantom run 
queue.  Its just that for the suspended processes sitting on the phantom 
queue this processing stops with the call to try_to_wake_up, until the 
process is moved back onto a run queue with a CPU.

The only way I can see what your talking about happening is for some kernel 
code (or driver) to grab a lock and then hold it across a call to one of the 
sleep_on functions pending some I/O.

Any driver that holds a lock across any sleep_on call I think is abusing 
locks and needs adjusting.

Nothing prevents someone writing a driver that abuses locks.

If you know of such a case I need to worry about or there is another way for 
this design to get into trouble please let me know.  I'll look into it as 
quickly as I can.

--mgross

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-15 20:53     ` Mark Gross
@ 2002-05-16 10:11       ` Pavel Machek
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Machek @ 2002-05-16 10:11 UTC (permalink / raw)
  To: Mark Gross
  Cc: Pavel Machek, Vamsi Krishna S .,
	Gross, Mark, 'Erich Focht',
	Linus Torvalds, linux-kernel, 'Bharata B Rao'

Hi!

> > Thread 1 is in kernel and holds lock A. You need lock A to dump state.
> > When you move 1 to phantom runqueue, you loose ability to get A and
> > deadlock.
> >
> > What prevents that?
> 
> Any pending tasklet / bottom half + top half get processes by the real CPU's 
> even thought the I/O bound process may have been moved to the phantom run 
> queue.  Its just that for the suspended processes sitting on the phantom 
> queue this processing stops with the call to try_to_wake_up, until the 
> process is moved back onto a run queue with a CPU.
> 
> The only way I can see what your talking about happening is for some kernel 
> code (or driver) to grab a lock and then hold it across a call to one of the 
> sleep_on functions pending some I/O.
> 
> Any driver that holds a lock across any sleep_on call I think is abusing 
> locks and needs adjusting.

I do not think so. It is okay to grab a lock then sleep.
									Pavel
-- 
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
@ 2002-05-20 15:44 Gross, Mark
  0 siblings, 0 replies; 17+ messages in thread
From: Gross, Mark @ 2002-05-20 15:44 UTC (permalink / raw)
  To: Erich Focht, mark.gross
  Cc: linux-kernel, Robert Love, Daniel Jacobowitz, Alan Cox

The first thing I would like to get right is besides the mmap_sem are there
any other semaphores that need looking out for.  I'm working on this now.
Are there any thoughts on this issue?

Is simply not grabbing the mmap_sem inside of elf_core_dump for
multithreaded dumps a viable option?

I like your second suggestion more than the first.  I think it isolates the
changes needed to make TCore work more than tweaking the task struct and
_down_write (to write to the proposed new task data member).

(W) 503-712-8218
MS: JF1-05
2111 N.E. 25th Ave.
Hillsboro, OR 97124


> -----Original Message-----
> From: Erich Focht [mailto:efocht@ess.nec.de]
> Sent: Friday, May 17, 2002 5:26 AM
> To: mark.gross@intel.com
> Cc: linux-kernel; Robert Love; Daniel Jacobowitz; Alan Cox
> Subject: Re: PATCH Multithreaded core dump support for the 2.5.14 (and
> 15) kernel.
> 
> 
> > The original question was:
> > Couldn't the TCore patch deadlock in elf_core_dump on a semiphore held
by a
> > sleeping process that gets placed onto the phantom runque?
> > 
> > So far I can't tell the problem is real or not, but I'm worried :(
> > 
> > I haven't hit any such deadlocks in my stress testing, such as it is. In
my
> > review of the code I don't see any obviouse problems dispite the fact
that
> > the mmap_sem is explicitly grabbed by elf_core_dump.
> > 
> > --mgross
> 
> Here are two different examples:
>  - some ps [1] does __down_read(mm->mmap_sem).
>  - meanwhile one of the soon crashing threads [2] does sys_mmap(),
>    calls __down_write(current->mmap_sem), gets on the wait queue
>    because the semaphore is currently used by ps.
>  - another thread [3] crashes and wants to dump core, sends [2] to
>    the phantom rq, calls __down_read(current->mmap_sem) and waits.
>  - [1] finishes the job, calls __up_read(mm->mmap_sem), activates
>    [2] on the phantom rq, exits.
> deadlock
> 
> Or:
>  - thread [2] does sys_mmap(), calls __down_write(current->mmap_sem),
>    gets the semaphore.
>  - thread [2] is preempted, taken off the cpu
>  - meanwhile thread [3] crashes, etc...
> 
> I think the problem only occurs if one of the related threads calls
> __down_write() for one of the semaphores we need to get inside
> elf_core_dump (which are these?). So maybe we could do two things:
> 
>  - remeber the task which _has_ the write lock (add a "task_t 
> sem_writer;"
> variable to the semaphore structure)
> 
>  - inside elf_core_dump use a special version of __down_read() which
> checks whether any related thread is enqueued and waiting for this
> semaphore or whether sem_writer points to a member of the own thread
> group. The phantom rq lock should be held. This new __down_read()
> could wait until only related threads are enqueued and 
> waiting and just
> deal as if the semaphore is free (temporarilly set the value to zero),
> and add its original value at the end, when calling __up_read().
> 
> Just some thoughts... any opinions?
> 
> Regards,
> Erich
> 
> -- 
> Dr. Erich Focht                                <efocht@ess.nec.de>
> NEC European Supercomputer Systems, European HPC Technology Center
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
@ 2002-05-17 12:26 Erich Focht
  0 siblings, 0 replies; 17+ messages in thread
From: Erich Focht @ 2002-05-17 12:26 UTC (permalink / raw)
  To: mark.gross; +Cc: linux-kernel, Robert Love, Daniel Jacobowitz, Alan Cox

> The original question was:
> Couldn't the TCore patch deadlock in elf_core_dump on a semiphore held by a
> sleeping process that gets placed onto the phantom runque?
> 
> So far I can't tell the problem is real or not, but I'm worried :(
> 
> I haven't hit any such deadlocks in my stress testing, such as it is. In my
> review of the code I don't see any obviouse problems dispite the fact that
> the mmap_sem is explicitly grabbed by elf_core_dump.
> 
> --mgross

Here are two different examples:
 - some ps [1] does __down_read(mm->mmap_sem).
 - meanwhile one of the soon crashing threads [2] does sys_mmap(),
   calls __down_write(current->mmap_sem), gets on the wait queue
   because the semaphore is currently used by ps.
 - another thread [3] crashes and wants to dump core, sends [2] to
   the phantom rq, calls __down_read(current->mmap_sem) and waits.
 - [1] finishes the job, calls __up_read(mm->mmap_sem), activates
   [2] on the phantom rq, exits.
deadlock

Or:
 - thread [2] does sys_mmap(), calls __down_write(current->mmap_sem),
   gets the semaphore.
 - thread [2] is preempted, taken off the cpu
 - meanwhile thread [3] crashes, etc...

I think the problem only occurs if one of the related threads calls
__down_write() for one of the semaphores we need to get inside
elf_core_dump (which are these?). So maybe we could do two things:

 - remeber the task which _has_ the write lock (add a "task_t sem_writer;"
variable to the semaphore structure)

 - inside elf_core_dump use a special version of __down_read() which
checks whether any related thread is enqueued and waiting for this
semaphore or whether sem_writer points to a member of the own thread
group. The phantom rq lock should be held. This new __down_read()
could wait until only related threads are enqueued and waiting and just
deal as if the semaphore is free (temporarilly set the value to zero),
and add its original value at the end, when calling __up_read().

Just some thoughts... any opinions?

Regards,
Erich

-- 
Dr. Erich Focht                                <efocht@ess.nec.de>
NEC European Supercomputer Systems, European HPC Technology Center


^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <59885C5E3098D511AD690002A5072D3C057B485B@orsmsx111.jf.intel.com.suse.lists.linux.kernel>]

[parent not found: <20020515120722.A17644@in.ibm.com.suse.lists.linux.kernel>]

[parent not found: <20020515140448.C37@toy.ucw.cz.suse.lists.linux.kernel>]

[parent not found: <200205152353.g4FNrew30146@unix-os.sc.intel.com.suse.lists.linux.kernel>]

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
       [not found]     ` <200205152353.g4FNrew30146@unix-os.sc.intel.com.suse.lists.linux.kernel>
@ 2002-05-16 12:54       ` Andi Kleen
  2002-05-16 14:13         ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2002-05-16 12:54 UTC (permalink / raw)
  To: Mark Gross; +Cc: linux-kernel

Mark Gross <mgross@unix-os.sc.intel.com> writes:
> 
> Any driver that holds a lock across any sleep_on call I think is abusing 
> locks and needs adjusting.

That's true for spinlocks, but not for semaphores. The mm layer and the 
vfs layer both use semaphores extensively and sleep with them hold, 
also some other subsystems (like networking) use sleeping locks.

-Andi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 12:54       ` Andi Kleen
@ 2002-05-16 14:13         ` Mark Gross
  2002-05-16 17:27           ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Gross @ 2002-05-16 14:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thursday 16 May 2002 08:54 am, Andi Kleen wrote:
> Mark Gross <mgross@unix-os.sc.intel.com> writes:
> > Any driver that holds a lock across any sleep_on call I think is abusing
> > locks and needs adjusting.
>
> That's true for spinlocks, but not for semaphores. The mm layer and the
> vfs layer both use semaphores extensively and sleep with them hold,
> also some other subsystems (like networking) use sleeping locks.
>
> -Andi

Hmmm, then the only nasty bit I see is the down_write(&current->mm->mmap_sem) 
in elf_core_dump.  

If as durring a core dump, one of the suspended thread processes had the 
mmap_sem for the currently dumping process, AND was sleeping, then I could 
get into trouble.  This could happen with thread processes created using the 
CLONE_VM flag (pthread applications use this flag).

I've just spent a bit of time grepping around looking for places a process 
could grab the mmap_sem and then sleep but didn't find anything.   I know 
this doesn't prove anything, but I had to look ;)

Does anyone knowlegible with the use / role of the mm_sem in the kernel have 
any insight into its use that would help me?  I would really like to make the 
TCORE patch work well.

Also, does anyone know WHY the mmap_sem is needed in the elf_core_dump code, 
and is this need still valid if I've suspended all the other processes that 
could even touch that mm?  I.e. can I fix this by removing the down_write / 
up_write in elf_core_dump?

--mgross

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 14:13         ` Mark Gross
@ 2002-05-16 17:27           ` Andi Kleen
  2002-05-16 17:36             ` Daniel Jacobowitz
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2002-05-16 17:27 UTC (permalink / raw)
  To: Mark Gross; +Cc: Andi Kleen, linux-kernel

On Thu, May 16, 2002 at 10:13:40AM -0400, Mark Gross wrote:
> Also, does anyone know WHY the mmap_sem is needed in the elf_core_dump code, 
> and is this need still valid if I've suspended all the other processes that 
> could even touch that mm?  I.e. can I fix this by removing the down_write / 
> up_write in elf_core_dump?

The mmap_sem is needed to access current->mm (especially the vma list)
safely. Otherwise someone else sharing the mm_struct could modify it. 
If you make sure all others sharing the mm_struct are killed first 
(including now way for them to start new clones inbetween) then
the only loophole left would be remote access using /proc/pid/mem or ptrace. 
If you handle that too then it is probably safe to drop it. Unfortunately
I don't see a way to handle these remote users without at least 
taking it temporarily.

Of course there are other semaphores in involved in dumping too (e.g. the
VFS ->write code may take the i_sem or other private ones). I guess they 
won't be a big problem if you first kill and then dump later.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 17:27           ` Andi Kleen
@ 2002-05-16 17:36             ` Daniel Jacobowitz
  2002-05-16 18:08               ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Jacobowitz @ 2002-05-16 17:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mark Gross, linux-kernel

On Thu, May 16, 2002 at 07:27:59PM +0200, Andi Kleen wrote:
> On Thu, May 16, 2002 at 10:13:40AM -0400, Mark Gross wrote:
> > Also, does anyone know WHY the mmap_sem is needed in the elf_core_dump code, 
> > and is this need still valid if I've suspended all the other processes that 
> > could even touch that mm?  I.e. can I fix this by removing the down_write / 
> > up_write in elf_core_dump?
> 
> The mmap_sem is needed to access current->mm (especially the vma list)
> safely. Otherwise someone else sharing the mm_struct could modify it. 
> If you make sure all others sharing the mm_struct are killed first 
> (including now way for them to start new clones inbetween) then
> the only loophole left would be remote access using /proc/pid/mem or ptrace. 
> If you handle that too then it is probably safe to drop it. Unfortunately
> I don't see a way to handle these remote users without at least 
> taking it temporarily.
> 
> Of course there are other semaphores in involved in dumping too (e.g. the
> VFS ->write code may take the i_sem or other private ones). I guess they 
> won't be a big problem if you first kill and then dump later.

Except unfortunately we don't kill; the other threads are resumed
afterwards for cleanup.  They're just suspended.

-- 
Daniel Jacobowitz                           Carnegie Mellon University
MontaVista Software                         Debian GNU/Linux Developer

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 17:36             ` Daniel Jacobowitz
@ 2002-05-16 18:08               ` Mark Gross
  2002-05-16 21:32                 ` Alan Cox
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Gross @ 2002-05-16 18:08 UTC (permalink / raw)
  To: Daniel Jacobowitz, Andi Kleen; +Cc: linux-kernel

On Thursday 16 May 2002 01:36 pm, Daniel Jacobowitz wrote:
> On Thu, May 16, 2002 at 07:27:59PM +0200, Andi Kleen wrote:
> > On Thu, May 16, 2002 at 10:13:40AM -0400, Mark Gross wrote:
> > > Also, does anyone know WHY the mmap_sem is needed in the elf_core_dump
> > > code, and is this need still valid if I've suspended all the other
> > > processes that could even touch that mm?  I.e. can I fix this by
> > > removing the down_write / up_write in elf_core_dump?
> >
> > The mmap_sem is needed to access current->mm (especially the vma list)
> > safely. Otherwise someone else sharing the mm_struct could modify it.
> > If you make sure all others sharing the mm_struct are killed first
> > (including now way for them to start new clones inbetween) then
> > the only loophole left would be remote access using /proc/pid/mem or
> > ptrace. If you handle that too then it is probably safe to drop it.
> > Unfortunately I don't see a way to handle these remote users without at
> > least
> > taking it temporarily.
> >
> > Of course there are other semaphores in involved in dumping too (e.g. the
> > VFS ->write code may take the i_sem or other private ones). I guess they
> > won't be a big problem if you first kill and then dump later.
>
> Except unfortunately we don't kill; the other threads are resumed
> afterwards for cleanup.  They're just suspended.

Yes, they start back up after the dump.  

It certainly seems that with the processes paused that the use of the 
current->mm->mm_sem could be obsolete for core dumps.  I'm not so sure 
protecting the core file data from ptrace or /proc/pid/mem is important in 
the case of core dumping.

I just don't want the kernel to lock up dumping the multithreaded core file.

I'm still not sure we have a problem yet.  (wishful thinking I suppose).   
Also I've seen zero lock ups from semaphore being held by one of the 
processes getting pauses temporarily in my testing on the patch I posted.

To restate: the only way I see that my design gets into trouble is when a 
semaphore is HELD, not getting waited on, by one of the processes that gets 
put onto the phantom runqueue, AND that semaphore is needed in the processing 
of elf_core_dump(...).

For this to happen that semaphore would have to held across schedule()'s.  
The ONLY place I've seen that in the kernel is set_CPUs_allowed + 
migration_thread.  

Can someone point me at other critical sections that have non-deterministic 
life times as a function of when the process holding the semaphore gets 
scheduled onto a CPU?  That type of code seems very risky to me.  This is the 
only type of code that could get my design into trouble.

--mgross

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 18:08               ` Mark Gross
@ 2002-05-16 21:32                 ` Alan Cox
  2002-05-16 21:24                   ` Robert Love
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Cox @ 2002-05-16 21:32 UTC (permalink / raw)
  To: mgross; +Cc: Daniel Jacobowitz, Andi Kleen, linux-kernel

> For this to happen that semaphore would have to held across schedule()'s.  
> The ONLY place I've seen that in the kernel is set_CPUs_allowed + 
> migration_thread.  

The 2.5 kernel is pre-emptible.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 21:32                 ` Alan Cox
@ 2002-05-16 21:24                   ` Robert Love
  2002-05-16 18:40                     ` Mark Gross
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Love @ 2002-05-16 21:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: mgross, Daniel Jacobowitz, Andi Kleen, linux-kernel

On Thu, 2002-05-16 at 14:32, Alan Cox wrote:
> > For this to happen that semaphore would have to held across schedule()'s.  
> > The ONLY place I've seen that in the kernel is set_CPUs_allowed + 
> > migration_thread.  
>
> The 2.5 kernel is pre-emptible.

Indeed :)

But there is plenty of places in the kernel - sans preemption - where we
sleep while holding a semaphore.  Was that the original question?  If
so, set_cpus_allowed by be one of the few _explicit_ places but we
implicitly sleep holding a semaphore all over.  Heck, we use them for
user-space synchronization.

	Robert Love


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-16 21:24                   ` Robert Love
@ 2002-05-16 18:40                     ` Mark Gross
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Gross @ 2002-05-16 18:40 UTC (permalink / raw)
  To: Robert Love, Alan Cox; +Cc: Daniel Jacobowitz, Andi Kleen, linux-kernel

On Thursday 16 May 2002 05:24 pm, Robert Love wrote:
> On Thu, 2002-05-16 at 14:32, Alan Cox wrote:
> > > For this to happen that semaphore would have to held across
> > > schedule()'s. The ONLY place I've seen that in the kernel is
> > > set_CPUs_allowed + migration_thread.
> >
> > The 2.5 kernel is pre-emptible.
>
> Indeed :)
>
> But there is plenty of places in the kernel - sans preemption - where we
> sleep while holding a semaphore.  Was that the original question?  If
> so, set_cpus_allowed by be one of the few _explicit_ places but we
> implicitly sleep holding a semaphore all over.  Heck, we use them for
> user-space synchronization.
>
> 	Robert Love
>

The original question was:
Couldn't the TCore patch deadlock in elf_core_dump on a semiphore held by a 
sleeping process that gets placed onto the phantom runque?

So far I can't tell the problem is real or not, but I'm worried :(

I haven't hit any such deadlocks in my stress testing, such as it is.  In my 
review of the code I don't see any obviouse problems dispite the fact that 
the mmap_sem is explicitly grabbed by elf_core_dump.

--mgross

^ permalink raw reply	[flat|nested] 17+ messages in thread

* PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
@ 2002-05-13 19:17 Mark Gross
  2002-05-14 15:35 ` Erich Focht
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Gross @ 2002-05-13 19:17 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel; +Cc: Vamsi Krishna S ., efocht, mark, mark.gross

The following patch for 2.5.14 kernel, applies cleanly to the 2.5.15 kernel.

This work has been tested on the 2.5.14 kernel using a few pthread applications to dump core, from SIGQUIT and SIGSEV.   
This unit test has been done on both 2 and 4 way systems.  Further, some stress testing has been done where, the core 
files have been created while the system is under schedule stress from the chat room benchmark running while creating 
the core files.  This implementation seems to be quit stable under a busy scheduler, YMMV.  These test programs are 
available uppon request ;)

This version of the patch cleans up all the issues that have been raised with it to date.
1) down_right(p->mm_sem) bug in my last patch, FIXED.
2) suspend/resume_thread function names too generic, make tcore specific, FIXED
3) Too man locks grabbed in resume_threads function, FIXED

Useage:  echo 1 > /proc/sys/kernel/core_dumps_threads
enables the multithreaded core file creation.

Check your version of gdb.  I hear 5.2 will work without a problem.  If you have 5.1 you may need to "strip libpthread*" 
to work around some issues that version has with loading pthread symbols with these core files.

Most of the patch is the same as that posted on 3/21/02, by Vamsi, with some minor fixes and the 
rebasing to the 2.5.14 kernel.  The interesting bits are in the additions to sched.c to pause and resume 
the thread processes under the O(1) scheduler.

--mgross


diff -urN -X dontdiff linux-2.5.14.vannilla/arch/i386/kernel/i387.c linux2.5.14.tcore/arch/i386/kernel/i387.c
--- linux-2.5.14.vannilla/arch/i386/kernel/i387.c	Sun May  5 23:38:06 2002
+++ linux2.5.14.tcore/arch/i386/kernel/i387.c	Tue May  7 14:59:10 2002
@@ -528,3 +528,36 @@
 
 	return fpvalid;
 }
+
+int dump_task_fpu( struct task_struct *tsk, struct user_i387_struct *fpu )
+{
+	int fpvalid;
+
+	fpvalid = tsk->used_math;
+	if ( fpvalid ) {
+		if (tsk == current) unlazy_fpu( tsk );
+		if ( cpu_has_fxsr ) {
+			copy_fpu_fxsave( tsk, fpu );
+		} else {
+			copy_fpu_fsave( tsk, fpu );
+		}
+	}
+
+	return fpvalid;
+}
+
+int dump_task_extended_fpu( struct task_struct *tsk, struct user_fxsr_struct *fpu )
+{
+	int fpvalid;
+	
+	fpvalid = tsk->used_math && cpu_has_fxsr;
+	if ( fpvalid ) {
+		if (tsk == current) unlazy_fpu( tsk );
+		memcpy( fpu, &tsk->thread.i387.fxsave,
+		sizeof(struct user_fxsr_struct) );
+	}
+	
+	return fpvalid;
+}
+
+
diff -urN -X dontdiff linux-2.5.14.vannilla/arch/i386/kernel/process.c linux2.5.14.tcore/arch/i386/kernel/process.c
--- linux-2.5.14.vannilla/arch/i386/kernel/process.c	Sun May  5 23:37:52 2002
+++ linux2.5.14.tcore/arch/i386/kernel/process.c	Wed May  8 13:39:19 2002
@@ -610,6 +610,18 @@
 
 	dump->u_fpvalid = dump_fpu (regs, &dump->i387);
 }
+/* 
+ * Capture the user space registers if the task is not running (in user space)
+ */
+int dump_task_regs(struct task_struct *tsk, struct pt_regs *regs)
+{
+	*regs = *(struct pt_regs *)((unsigned long)tsk->thread_info + THREAD_SIZE - sizeof(struct pt_regs));
+	regs->xcs &= 0xffff;
+	regs->xds &= 0xffff;
+	regs->xes &= 0xffff;
+	regs->xss &= 0xffff;
+	return 1;
+}
 
 /*
  * This special macro can be used to load a debugging register
diff -urN -X dontdiff linux-2.5.14.vannilla/fs/binfmt_elf.c linux2.5.14.tcore/fs/binfmt_elf.c
--- linux-2.5.14.vannilla/fs/binfmt_elf.c	Sun May  5 23:38:01 2002
+++ linux2.5.14.tcore/fs/binfmt_elf.c	Mon May 13 12:45:59 2002
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>
 #include <linux/stat.h>
+#include <linux/sched.h>
 #include <linux/time.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
@@ -30,6 +31,7 @@
 #include <linux/elfcore.h>
 #include <linux/init.h>
 #include <linux/highuid.h>
+#include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/compiler.h>
 #include <linux/highmem.h>
@@ -958,7 +960,7 @@
 /* #define DEBUG */
 
 #ifdef DEBUG
-static void dump_regs(const char *str, elf_greg_t *r)
+static void dump_regs(const char *str, elf_gregset_t *r)
 {
 	int i;
 	static const char *regs[] = { "ebx", "ecx", "edx", "esi", "edi", "ebp",
@@ -1006,6 +1008,163 @@
 #define DUMP_SEEK(off)	\
 	if (!dump_seek(file, (off))) \
 		goto end_coredump;
+
+static inline void fill_elf_header(struct elfhdr *elf, int segs)
+{
+	memcpy(elf->e_ident, ELFMAG, SELFMAG);
+	elf->e_ident[EI_CLASS] = ELF_CLASS;
+	elf->e_ident[EI_DATA] = ELF_DATA;
+	elf->e_ident[EI_VERSION] = EV_CURRENT;
+	memset(elf->e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
+
+	elf->e_type = ET_CORE;
+	elf->e_machine = ELF_ARCH;
+	elf->e_version = EV_CURRENT;
+	elf->e_entry = 0;
+	elf->e_phoff = sizeof(struct elfhdr);
+	elf->e_shoff = 0;
+	elf->e_flags = 0;
+	elf->e_ehsize = sizeof(struct elfhdr);
+	elf->e_phentsize = sizeof(struct elf_phdr);
+	elf->e_phnum = segs;
+	elf->e_shentsize = 0;
+	elf->e_shnum = 0;
+	elf->e_shstrndx = 0;
+	return;
+}
+
+static inline void fill_elf_note_phdr(struct elf_phdr *phdr, int sz, off_t offset)
+{
+	phdr->p_type = PT_NOTE;
+	phdr->p_offset = offset;
+	phdr->p_vaddr = 0;
+	phdr->p_paddr = 0;
+	phdr->p_filesz = sz;
+	phdr->p_memsz = 0;
+	phdr->p_flags = 0;
+	phdr->p_align = 0;
+	return;
+}
+
+static inline void fill_note(struct memelfnote *note, const char *name, int type, 
+		unsigned int sz, void *data)
+{
+	note->name = name;
+	note->type = type;
+	note->datasz = sz;
+	note->data = data;
+	return;
+}
+
+/*
+ * fill up all the fields in prstatus from the given task struct, except registers
+ * which need to be filled up seperately.
+ */
+static inline void fill_prstatus(struct elf_prstatus *prstatus, struct task_struct *p, long signr) 
+{
+	prstatus->pr_info.si_signo = prstatus->pr_cursig = signr;
+	prstatus->pr_sigpend = p->pending.signal.sig[0];
+	prstatus->pr_sighold = p->blocked.sig[0];
+	prstatus->pr_pid = p->pid;
+	prstatus->pr_ppid = p->parent->pid;
+	prstatus->pr_pgrp = p->pgrp;
+	prstatus->pr_sid = p->session;
+	prstatus->pr_utime.tv_sec = CT_TO_SECS(p->times.tms_utime);
+	prstatus->pr_utime.tv_usec = CT_TO_USECS(p->times.tms_utime);
+	prstatus->pr_stime.tv_sec = CT_TO_SECS(p->times.tms_stime);
+	prstatus->pr_stime.tv_usec = CT_TO_USECS(p->times.tms_stime);
+	prstatus->pr_cutime.tv_sec = CT_TO_SECS(p->times.tms_cutime);
+	prstatus->pr_cutime.tv_usec = CT_TO_USECS(p->times.tms_cutime);
+	prstatus->pr_cstime.tv_sec = CT_TO_SECS(p->times.tms_cstime);
+	prstatus->pr_cstime.tv_usec = CT_TO_USECS(p->times.tms_cstime);
+	return;
+}
+
+static inline void fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p)
+{
+	int i;
+	
+	psinfo->pr_pid = p->pid;
+	psinfo->pr_ppid = p->parent->pid;
+	psinfo->pr_pgrp = p->pgrp;
+	psinfo->pr_sid = p->session;
+
+	i = p->state ? ffz(~p->state) + 1 : 0;
+	psinfo->pr_state = i;
+	psinfo->pr_sname = (i < 0 || i > 5) ? '.' : "RSDZTD"[i];
+	psinfo->pr_zomb = psinfo->pr_sname == 'Z';
+	psinfo->pr_nice =  task_nice(p);
+	psinfo->pr_flag = p->flags;
+	psinfo->pr_uid = NEW_TO_OLD_UID(p->uid);
+	psinfo->pr_gid = NEW_TO_OLD_GID(p->gid);
+	strncpy(psinfo->pr_fname, p->comm, sizeof(psinfo->pr_fname));
+	return;
+}
+
+/*
+ * This is the variable that can be set in proc to determine if we want to
+ * dump a multithreaded core or not. A value of 1 means yes while any
+ * other value means no.
+ *
+ * It is located at /proc/sys/kernel/core_dumps_threads
+ */
+extern int core_dumps_threads;
+
+/* Here is the structure in which status of each thread is captured. */
+struct elf_thread_status
+{
+	struct list_head list;
+	struct elf_prstatus prstatus;	/* NT_PRSTATUS */
+	elf_fpregset_t fpu;		/* NT_PRFPREG */
+	elf_fpxregset_t xfpu;		/* NT_PRXFPREG */
+	struct memelfnote notes[3];
+	int num_notes;
+};
+
+/*
+ * In order to add the specific thread information for the elf file format,
+ * we need to keep a linked list of every threads pr_status and then
+ * create a single section for them in the final core file.
+ */
+static int elf_dump_thread_status(long signr, struct task_struct * p, struct list_head * thread_list)
+{
+
+	struct elf_thread_status *t;
+	int sz = 0;
+
+	t = kmalloc(sizeof(*t), GFP_KERNEL);
+	if (!t) {
+		printk(KERN_WARNING "Cannot allocate memory for thread status.\n");
+		return 0;
+	}
+
+	INIT_LIST_HEAD(&t->list);
+	t->num_notes = 0;
+
+	fill_prstatus(&t->prstatus, p, signr);
+	elf_core_copy_task_regs(p, &t->prstatus.pr_reg);	
+	fill_note(&t->notes[0], "CORE", NT_PRSTATUS, sizeof(t->prstatus), &(t->prstatus));
+	t->num_notes++;
+	sz += notesize(&t->notes[0]);
+
+	if ((t->prstatus.pr_fpvalid = elf_core_copy_task_fpregs(p, &t->fpu))) {
+		fill_note(&t->notes[1], "CORE", NT_PRFPREG, sizeof(t->fpu), &(t->fpu));
+		t->num_notes++;
+		sz += notesize(&t->notes[1]);
+	}
+
+	if (elf_core_copy_task_xfpregs(p, &t->xfpu)) {
+		fill_note(&t->notes[2], "LINUX", NT_PRXFPREG, sizeof(t->xfpu), &(t->xfpu));
+		t->num_notes++;
+		sz += notesize(&t->notes[2]);
+	}
+	
+	list_add(&t->list, thread_list);
+	return sz;
+}
+
+
+
 /*
  * Actual dumper
  *
@@ -1024,12 +1183,25 @@
 	struct elfhdr elf;
 	off_t offset = 0, dataoff;
 	unsigned long limit = current->rlim[RLIMIT_CORE].rlim_cur;
-	int numnote = 4;
-	struct memelfnote notes[4];
+	int numnote = 5;
+	struct memelfnote notes[5];
 	struct elf_prstatus prstatus;	/* NT_PRSTATUS */
-	elf_fpregset_t fpu;		/* NT_PRFPREG */
 	struct elf_prpsinfo psinfo;	/* NT_PRPSINFO */
+ 	struct task_struct *p;
+ 	LIST_HEAD(thread_list);
+ 	struct list_head *t;
+	elf_fpregset_t fpu;
+	elf_fpxregset_t xfpu;
+	int dump_threads = core_dumps_threads; /* this value should not change once the */
+					/* dumping starts */
+	int thread_status_size = 0;
+	
 
+	/* First pause all related threaded processes */
+	if (dump_threads)	{
+		tcore_suspend_threads();
+	}
+	
 	/* first copy the parameters from user space */
 	memset(&psinfo, 0, sizeof(psinfo));
 	{
@@ -1047,7 +1219,6 @@
 
 	}
 
-	memset(&prstatus, 0, sizeof(prstatus));
 	/*
 	 * This transfers the registers from regs into the standard
 	 * coredump arrangement, whatever that is.
@@ -1063,7 +1234,29 @@
 	else
 		*(struct pt_regs *)&prstatus.pr_reg = *regs;
 #endif
-
+ 
+	if (dump_threads) {
+		/* capture the status of all other threads */
+		if (signr) {
+			read_lock(&tasklist_lock);
+			for_each_task(p)
+				if (current->mm == p->mm && current != p) {
+					int sz = elf_dump_thread_status(signr, p, &thread_list);
+					if (!sz) {
+						read_unlock(&tasklist_lock);
+						goto cleanup;
+					}
+					else
+						thread_status_size += sz;
+				}
+			read_unlock(&tasklist_lock);
+		}
+	} /* End if(dump_threads) */
+	
+	memset(&prstatus, 0, sizeof(prstatus));
+	fill_prstatus(&prstatus, current, signr);
+	elf_core_copy_regs(&prstatus.pr_reg, regs);
+	
 	/* now stop all vm operations */
 	down_write(&current->mm->mmap_sem);
 	segs = current->mm->map_count;
@@ -1073,25 +1266,7 @@
 #endif
 
 	/* Set up header */
-	memcpy(elf.e_ident, ELFMAG, SELFMAG);
-	elf.e_ident[EI_CLASS] = ELF_CLASS;
-	elf.e_ident[EI_DATA] = ELF_DATA;
-	elf.e_ident[EI_VERSION] = EV_CURRENT;
-	memset(elf.e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
-
-	elf.e_type = ET_CORE;
-	elf.e_machine = ELF_ARCH;
-	elf.e_version = EV_CURRENT;
-	elf.e_entry = 0;
-	elf.e_phoff = sizeof(elf);
-	elf.e_shoff = 0;
-	elf.e_flags = 0;
-	elf.e_ehsize = sizeof(elf);
-	elf.e_phentsize = sizeof(struct elf_phdr);
-	elf.e_phnum = segs+1;		/* Include notes */
-	elf.e_shentsize = 0;
-	elf.e_shnum = 0;
-	elf.e_shstrndx = 0;
+	fill_elf_header(&elf, segs+1); /* including notes section*/
 
 	fs = get_fs();
 	set_fs(KERNEL_DS);
@@ -1108,64 +1283,31 @@
 	 * with info from their /proc.
 	 */
 
-	notes[0].name = "CORE";
-	notes[0].type = NT_PRSTATUS;
-	notes[0].datasz = sizeof(prstatus);
-	notes[0].data = &prstatus;
-	prstatus.pr_info.si_signo = prstatus.pr_cursig = signr;
-	prstatus.pr_sigpend = current->pending.signal.sig[0];
-	prstatus.pr_sighold = current->blocked.sig[0];
-	psinfo.pr_pid = prstatus.pr_pid = current->pid;
-	psinfo.pr_ppid = prstatus.pr_ppid = current->parent->pid;
-	psinfo.pr_pgrp = prstatus.pr_pgrp = current->pgrp;
-	psinfo.pr_sid = prstatus.pr_sid = current->session;
-	prstatus.pr_utime.tv_sec = CT_TO_SECS(current->times.tms_utime);
-	prstatus.pr_utime.tv_usec = CT_TO_USECS(current->times.tms_utime);
-	prstatus.pr_stime.tv_sec = CT_TO_SECS(current->times.tms_stime);
-	prstatus.pr_stime.tv_usec = CT_TO_USECS(current->times.tms_stime);
-	prstatus.pr_cutime.tv_sec = CT_TO_SECS(current->times.tms_cutime);
-	prstatus.pr_cutime.tv_usec = CT_TO_USECS(current->times.tms_cutime);
-	prstatus.pr_cstime.tv_sec = CT_TO_SECS(current->times.tms_cstime);
-	prstatus.pr_cstime.tv_usec = CT_TO_USECS(current->times.tms_cstime);
+	fill_note(&notes[0], "CORE", NT_PRSTATUS, sizeof(prstatus), &prstatus);
+ 	
+	fill_psinfo(&psinfo, current);
+	fill_note(&notes[1], "CORE", NT_PRPSINFO, sizeof(psinfo), &psinfo);
+	
+	fill_note(&notes[2], "CORE", NT_TASKSTRUCT, sizeof(*current), current);
+  
+  	/* Try to dump the FPU. */
+	if ((prstatus.pr_fpvalid = elf_core_copy_task_fpregs(current, &fpu))) {
+		fill_note(&notes[3], "CORE", NT_PRFPREG, sizeof(fpu), &fpu);
+	} else {
+		--numnote;
+ 	}
+	if (elf_core_copy_task_xfpregs(current, &xfpu)) {
+		fill_note(&notes[4], "LINUX", NT_PRXFPREG, sizeof(xfpu), &xfpu);
+	} else {
+		--numnote;
+	}
+  	
 
 #ifdef DEBUG
 	dump_regs("Passed in regs", (elf_greg_t *)regs);
 	dump_regs("prstatus regs", (elf_greg_t *)&prstatus.pr_reg);
 #endif
 
-	notes[1].name = "CORE";
-	notes[1].type = NT_PRPSINFO;
-	notes[1].datasz = sizeof(psinfo);
-	notes[1].data = &psinfo;
-	i = current->state ? ffz(~current->state) + 1 : 0;
-	psinfo.pr_state = i;
-	psinfo.pr_sname = (i < 0 || i > 5) ? '.' : "RSDZTD"[i];
-	psinfo.pr_zomb = psinfo.pr_sname == 'Z';
-	psinfo.pr_nice = task_nice(current);
-	psinfo.pr_flag = current->flags;
-	psinfo.pr_uid = NEW_TO_OLD_UID(current->uid);
-	psinfo.pr_gid = NEW_TO_OLD_GID(current->gid);
-	strncpy(psinfo.pr_fname, current->comm, sizeof(psinfo.pr_fname));
-
-	notes[2].name = "CORE";
-	notes[2].type = NT_TASKSTRUCT;
-	notes[2].datasz = sizeof(*current);
-	notes[2].data = current;
-
-	/* Try to dump the FPU. */
-	prstatus.pr_fpvalid = dump_fpu (regs, &fpu);
-	if (!prstatus.pr_fpvalid)
-	{
-		numnote--;
-	}
-	else
-	{
-		notes[3].name = "CORE";
-		notes[3].type = NT_PRFPREG;
-		notes[3].datasz = sizeof(fpu);
-		notes[3].data = &fpu;
-	}
-	
 	/* Write notes phdr entry */
 	{
 		struct elf_phdr phdr;
@@ -1173,17 +1315,12 @@
 
 		for(i = 0; i < numnote; i++)
 			sz += notesize(&notes[i]);
+		
+		if (dump_threads)
+			sz += thread_status_size;
 
-		phdr.p_type = PT_NOTE;
-		phdr.p_offset = offset;
-		phdr.p_vaddr = 0;
-		phdr.p_paddr = 0;
-		phdr.p_filesz = sz;
-		phdr.p_memsz = 0;
-		phdr.p_flags = 0;
-		phdr.p_align = 0;
-
-		offset += phdr.p_filesz;
+		fill_elf_note_phdr(&phdr, sz, offset);
+		offset += sz;
 		DUMP_WRITE(&phdr, sizeof(phdr));
 	}
 
@@ -1212,10 +1349,21 @@
 		DUMP_WRITE(&phdr, sizeof(phdr));
 	}
 
+ 	/* write out the notes section */
 	for(i = 0; i < numnote; i++)
 		if (!writenote(&notes[i], file))
 			goto end_coredump;
 
+	/* write out the thread status notes section */
+ 	if (dump_threads)  {
+		list_for_each(t, &thread_list) {
+			struct elf_thread_status *tmp = list_entry(t, struct elf_thread_status, list);
+			for (i = 0; i < tmp->num_notes; i++)
+				if (!writenote(&tmp->notes[i], file))
+					goto end_coredump;
+		}
+ 	}
+ 
 	DUMP_SEEK(dataoff);
 
 	for(vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
@@ -1259,11 +1407,24 @@
 		       (off_t) file->f_pos, offset);
 	}
 
- end_coredump:
+end_coredump:
 	set_fs(fs);
+
+cleanup:
+	if (dump_threads)  {
+		while(!list_empty(&thread_list)) {
+			struct list_head *tmp = thread_list.next;
+			list_del(tmp);
+			kfree(list_entry(tmp, struct elf_thread_status, list));
+		}
+
+		tcore_resume_threads();
+	}
+
 	up_write(&current->mm->mmap_sem);
 	return has_dumped;
 }
+
 #endif		/* USE_ELF_CORE_DUMP */
 
 static int __init init_elf_binfmt(void)
diff -urN -X dontdiff linux-2.5.14.vannilla/include/asm-i386/elf.h linux2.5.14.tcore/include/asm-i386/elf.h
--- linux-2.5.14.vannilla/include/asm-i386/elf.h	Mon May  6 16:27:38 2002
+++ linux2.5.14.tcore/include/asm-i386/elf.h	Tue May  7 15:01:21 2002
@@ -99,6 +99,16 @@
 
 #ifdef __KERNEL__
 #define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)
+
+
+extern int dump_task_regs (struct task_struct *, struct pt_regs *);
+extern int dump_task_fpu (struct task_struct *, struct user_i387_struct *);
+extern int dump_task_extended_fpu (struct task_struct *, struct user_fxsr_struct *);
+
+#define ELF_CORE_COPY_TASK_REGS(tsk, pt_regs) dump_task_regs(tsk, pt_regs)
+#define ELF_CORE_COPY_FPREGS(tsk, elf_fpregs) dump_task_fpu(tsk, elf_fpregs)
+#define ELF_CORE_COPY_XFPREGS(tsk, elf_xfpregs) dump_task_extended_fpu(tsk, elf_xfpregs)
+
 #endif
 
 #endif
diff -urN -X dontdiff linux-2.5.14.vannilla/include/linux/elf.h linux2.5.14.tcore/include/linux/elf.h
--- linux-2.5.14.vannilla/include/linux/elf.h	Mon May  6 16:27:38 2002
+++ linux2.5.14.tcore/include/linux/elf.h	Tue May  7 15:22:55 2002
@@ -576,6 +576,9 @@
 #define NT_PRPSINFO	3
 #define NT_TASKSTRUCT	4
 #define NT_PRFPXREG	20
+#define NT_PRXFPREG     0x46e62b7f	/* note name must be "LINUX" as per GDB */
+					/* from gdb5.1/include/elf/common.h */
+
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
diff -urN -X dontdiff linux-2.5.14.vannilla/include/linux/elfcore.h linux2.5.14.tcore/include/linux/elfcore.h
--- linux-2.5.14.vannilla/include/linux/elfcore.h	Mon May  6 16:27:38 2002
+++ linux2.5.14.tcore/include/linux/elfcore.h	Tue May  7 15:05:01 2002
@@ -86,4 +86,55 @@
 #define PRARGSZ ELF_PRARGSZ 
 #endif
 
+#ifdef __KERNEL__
+static inline void elf_core_copy_regs(elf_gregset_t *elfregs, struct pt_regs *regs)
+{
+#ifdef ELF_CORE_COPY_REGS
+	ELF_CORE_COPY_REGS((*elfregs), regs)
+#else
+	if (sizeof(elf_gregset_t) != sizeof(struct pt_regs)) {
+		printk("sizeof(elf_gregset_t) (%ld) != sizeof(struct pt_regs) (%ld)\n",
+			(long)sizeof(elf_gregset_t), (long)sizeof(struct pt_regs));
+	} else
+		*(struct pt_regs *)elfregs = *regs;
+#endif
+}
+
+static inline int elf_core_copy_task_regs(struct task_struct *t, elf_gregset_t *elfregs)
+{
+#ifdef ELF_CORE_COPY_TASK_REGS
+	struct pt_regs regs;
+	
+	if (ELF_CORE_COPY_TASK_REGS(t, &regs)) {
+		elf_core_copy_regs(elfregs, &regs);
+		return 1;
+	}
+#endif
+	return 0;
+}
+
+extern int dump_fpu (struct pt_regs *, elf_fpregset_t *);
+
+static inline int elf_core_copy_task_fpregs(struct task_struct *t, elf_fpregset_t *fpu)
+{
+#ifdef ELF_CORE_COPY_FPREGS
+	return ELF_CORE_COPY_FPREGS(t, fpu);
+#else
+	return dump_fpu(NULL, fpu);
+#endif
+}
+
+static inline int elf_core_copy_task_xfpregs(struct task_struct *t, elf_fpxregset_t *xfpu)
+{
+#ifdef ELF_CORE_COPY_XFPREGS
+	return ELF_CORE_COPY_XFPREGS(t, xfpu);
+#else
+	return 0;
+#endif
+}
+
+
+#endif /* __KERNEL__ */
+
+
 #endif /* _LINUX_ELFCORE_H */
diff -urN -X dontdiff linux-2.5.14.vannilla/include/linux/sched.h linux2.5.14.tcore/include/linux/sched.h
--- linux-2.5.14.vannilla/include/linux/sched.h	Mon May 13 09:21:07 2002
+++ linux2.5.14.tcore/include/linux/sched.h	Mon May 13 12:12:32 2002
@@ -130,6 +130,14 @@
 
 #include <linux/spinlock.h>
 
+
+/* functions for pausing and resumming functions 
+ * common mm's without using signals.  These are used
+ * by the multithreaded elf core dump code in binfmt_elf.c*/
+extern void tcore_suspend_threads( void );
+extern void tcore_resume_threads( void );
+
+
 /*
  * This serializes "schedule()" and also protects
  * the run-queue from deletions/modifications (but
diff -urN -X dontdiff linux-2.5.14.vannilla/include/linux/sysctl.h linux2.5.14.tcore/include/linux/sysctl.h
--- linux-2.5.14.vannilla/include/linux/sysctl.h	Mon May 13 09:21:06 2002
+++ linux2.5.14.tcore/include/linux/sysctl.h	Tue May  7 14:44:11 2002
@@ -87,6 +87,7 @@
 	KERN_CAP_BSET=14,	/* int: capability bounding set */
 	KERN_PANIC=15,		/* int: panic timeout */
 	KERN_REALROOTDEV=16,	/* real root device to mount after initrd */
+	KERN_CORE_DUMPS_THREADS=17, /* int: include status of others threads in dump */
 
 	KERN_SPARC_REBOOT=21,	/* reboot command on Sparc */
 	KERN_CTLALTDEL=22,	/* int: allow ctl-alt-del to reboot */
diff -urN -X dontdiff linux-2.5.14.vannilla/kernel/sched.c linux2.5.14.tcore/kernel/sched.c
--- linux-2.5.14.vannilla/kernel/sched.c	Sun May  5 23:37:57 2002
+++ linux2.5.14.tcore/kernel/sched.c	Mon May 13 12:24:25 2002
@@ -154,7 +154,8 @@
 	list_t migration_queue;
 } ____cacheline_aligned;
 
-static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
+static struct runqueue runqueues[NR_CPUS + 1] __cacheline_aligned;
+#define PHANTOM_CPU NR_CPUS
 
 #define cpu_rq(cpu)		(runqueues + (cpu))
 #define this_rq()		cpu_rq(smp_processor_id())
@@ -263,6 +264,9 @@
 #ifdef CONFIG_SMP
 	int need_resched, nrpolling;
 
+	if( unlikely(!p->cpus_allowed) )
+			return;
+			
 	preempt_disable();
 	/* minimise the chance of sending an interrupt to poll_idle() */
 	nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
@@ -273,6 +277,9 @@
 		smp_send_reschedule(p->thread_info->cpu);
 	preempt_enable();
 #else
+	// do we need the cpus_allowed test here for core_dumps_threads?
+	//if( unlikely(!p->cpus_allowed) return; // ?
+	
 	set_tsk_need_resched(p);
 #endif
 }
@@ -339,7 +346,7 @@
 	p->state = TASK_RUNNING;
 	if (!p->array) {
 		activate_task(p, rq);
-		if (p->prio < rq->curr->prio)
+		if (p->cpus_allowed && (p->prio < rq->curr->prio) )
 			resched_task(rq->curr);
 		success = 1;
 	}
@@ -996,6 +1003,131 @@
 
 void scheduling_functions_end_here(void) { }
 
+/*
+ * needed for accurate core dumps of multi-threaded applications.
+ * see binfmt_elf.c for more information.
+ */
+static void reschedule_other_cpus(void)
+{
+#ifdef CONFIG_SMP
+	int i, cpu;
+	struct task_struct *p;
+
+	for(i=0; i< smp_num_cpus; i++) {
+		cpu = cpu_logical_map(i);
+		p = cpu_curr(cpu);
+		if (p->thread_info->cpu!= smp_processor_id()) {
+			set_tsk_need_resched(p);
+			smp_send_reschedule(p->thread_info->cpu);
+		}
+	}
+#endif	
+	return;
+}
+
+
+/* functions for pausing and resumming functions with out using signals */
+void tcore_suspend_threads(void)
+{
+	unsigned long flags;
+	runqueue_t *phantomQ;
+	task_t *threads[NR_CPUS], *p;
+	int i, OnCPUCount = 0;
+
+//
+// grab all the rq_locks.
+// current is the process dumping core
+//  
+
+	preempt_disable();
+	
+	local_irq_save(flags);
+
+	for(i=0; i< smp_num_cpus; i++) {
+		spin_lock(&cpu_rq(i)->lock);
+	}
+
+	current->cpus_allowed = 1UL << current->thread_info->cpu;
+	// prevent migraion of dumping process making life complicated.
+
+	phantomQ = cpu_rq(PHANTOM_CPU); 
+	spin_lock(&phantomQ->lock);
+	
+	reschedule_other_cpus();
+	// this is an optional IPI, but it makes for the most accurate core files possible.
+	
+	read_lock(&tasklist_lock);
+
+	for_each_task(p) {
+		if (current->mm == p->mm && current != p) {
+			if( p == task_rq(p)->curr ) {
+				//then remember it for later us of set_cpus_allowed
+				threads[OnCPUCount] = p;
+				p->cpus_allowed = 0;//prevent load balance from moving these guys.
+				OnCPUCount ++;
+			} else {
+				// we manualy move the process to the phantom run queue.
+
+				if (p->array) {
+					deactivate_task(p, task_rq(p));
+					activate_task(p, phantomQ);
+				}
+				p->thread_info->cpu = PHANTOM_CPU;
+				p->cpus_allowed = 0;//prevent load balance from moving these guys.
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+
+	spin_unlock(&phantomQ->lock);
+	for(i=smp_num_cpus-1; 0<= i; i--) {
+		spin_unlock(&cpu_rq(i)->lock);
+	}
+
+	local_irq_restore(flags);
+
+	for( i = 0; i<OnCPUCount; i++) {
+		set_cpus_allowed(threads[i], 0);
+	}
+	
+}
+
+void tcore_resume_threads(void)
+{
+	unsigned long flags;
+	runqueue_t *phantomQ;
+	task_t *p;
+	int i;
+
+	local_irq_save(flags);
+	phantomQ = cpu_rq(PHANTOM_CPU);
+
+	spin_lock(&task_rq(current)->lock);
+	spin_lock(&phantomQ->lock);
+	
+	read_lock(&tasklist_lock);
+	for_each_task(p) {
+		if (current->mm == p->mm && current != p) {
+			p->cpus_allowed = 1UL << current->thread_info->cpu;
+			if (p->array) {
+				deactivate_task(p,phantomQ);
+				activate_task(p, task_rq(current));
+			}
+			p->thread_info->cpu = current->thread_info->cpu;
+		}
+	}
+
+	read_unlock(&tasklist_lock);
+
+	spin_unlock(&phantomQ->lock);
+	spin_unlock(&task_rq(current)->lock);
+
+	local_irq_restore(flags);
+	preempt_enable_no_resched();
+}
+
+
+
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
@@ -1582,11 +1714,11 @@
 {
 	runqueue_t *rq;
 	int i, j, k;
+	prio_array_t *array;
 
-	for (i = 0; i < NR_CPUS; i++) {
-		runqueue_t *rq = cpu_rq(i);
-		prio_array_t *array;
 
+	for (i = 0; i < NR_CPUS; i++) {
+		rq = cpu_rq(i);
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
@@ -1603,6 +1735,28 @@
 			__set_bit(MAX_PRIO, array->bitmap);
 		}
 	}
+
+ 
+	i = PHANTOM_CPU;
+	rq = cpu_rq(i);
+	rq->active = rq->arrays;
+	rq->expired = rq->arrays + 1;
+	rq->curr = NULL;
+	spin_lock_init(&rq->lock);
+	spin_lock_init(&rq->frozen);
+	INIT_LIST_HEAD(&rq->migration_queue);
+
+	for (j = 0; j < 2; j++) {
+		array = rq->arrays + j;
+		for (k = 0; k < MAX_PRIO; k++) {
+			INIT_LIST_HEAD(array->queue + k);
+			__clear_bit(k, array->bitmap);
+		}
+		// delimiter for bitsearch
+		__set_bit(MAX_PRIO, array->bitmap);
+	}
+
+
 	/*
 	 * We have to do a little magic to get the first
 	 * process right in SMP mode.
@@ -1662,9 +1816,11 @@
 	migration_req_t req;
 	runqueue_t *rq;
 
-	new_mask &= cpu_online_map;
-	if (!new_mask)
-		BUG();
+	if(new_mask){ // can be O for TCore process suspends
+		new_mask &= cpu_online_map;
+		if (!new_mask)
+			BUG();
+	}
 
 	preempt_disable();
 	rq = task_rq_lock(p, &flags);
@@ -1737,7 +1893,12 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+
+		if( p->cpus_allowed)
+			cpu_dest = __ffs(p->cpus_allowed);
+		else
+			cpu_dest = PHANTOM_CPU;
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = p->thread_info->cpu;
diff -urN -X dontdiff linux-2.5.14.vannilla/kernel/sysctl.c linux2.5.14.tcore/kernel/sysctl.c
--- linux-2.5.14.vannilla/kernel/sysctl.c	Sun May  5 23:37:54 2002
+++ linux2.5.14.tcore/kernel/sysctl.c	Tue May  7 14:39:37 2002
@@ -38,6 +38,8 @@
 #include <linux/nfs_fs.h>
 #endif
 
+int core_dumps_threads = 0;
+
 #if defined(CONFIG_SYSCTL)
 
 /* External variables not in a header file. */
@@ -171,7 +173,9 @@
 	 0644, NULL, &proc_dointvec},
 	{KERN_TAINTED, "tainted", &tainted, sizeof(int),
 	 0644, NULL, &proc_dointvec},
-	{KERN_CAP_BSET, "cap-bound", &cap_bset, sizeof(kernel_cap_t),
+	{KERN_CORE_DUMPS_THREADS, "core_dumps_threads", &core_dumps_threads, sizeof(int),
+	 0644, NULL, &proc_dointvec},
+	 {KERN_CAP_BSET, "cap-bound", &cap_bset, sizeof(kernel_cap_t),
 	 0600, NULL, &proc_dointvec_bset},
 #ifdef CONFIG_BLK_DEV_INITRD
 	{KERN_REALROOTDEV, "real-root-dev", &real_root_dev, sizeof(int),

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel.
  2002-05-13 19:17 Mark Gross
@ 2002-05-14 15:35 ` Erich Focht
  0 siblings, 0 replies; 17+ messages in thread
From: Erich Focht @ 2002-05-14 15:35 UTC (permalink / raw)
  To: mark.gross; +Cc: Linus Torvalds, linux-kernel, Vamsi Krishna S .

Hi Mark!

Thanks for sending the new patch, I'd be interested in the testprograms :-)

BTW: any idea what happens when a thread which is suspended happens to be in 
kernel mode? Guess this could be possible with 2.5.X... Does gdb handle that?

Regards,
Erich

On Monday 13 May 2002 21:17, you wrote:
> The following patch for 2.5.14 kernel, applies cleanly to the 2.5.15
> kernel.
>
> This work has been tested on the 2.5.14 kernel using a few pthread
> applications to dump core, from SIGQUIT and SIGSEV. This unit test has been
> done on both 2 and 4 way systems.  Further, some stress testing has been
> done where, the core files have been created while the system is under
> schedule stress from the chat room benchmark running while creating the
> core files.  This implementation seems to be quit stable under a busy
> scheduler, YMMV.  These test programs are available uppon request ;)




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2002-05-20 15:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-05-14 16:38 PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel Gross, Mark
2002-05-15  6:37 ` Vamsi Krishna S .
2002-05-15 14:04   ` Pavel Machek
2002-05-15 20:53     ` Mark Gross
2002-05-16 10:11       ` Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2002-05-20 15:44 Gross, Mark
2002-05-17 12:26 Erich Focht
     [not found] <59885C5E3098D511AD690002A5072D3C057B485B@orsmsx111.jf.intel.com.suse.lists.linux.kernel>
     [not found] ` <20020515120722.A17644@in.ibm.com.suse.lists.linux.kernel>
     [not found]   ` <20020515140448.C37@toy.ucw.cz.suse.lists.linux.kernel>
     [not found]     ` <200205152353.g4FNrew30146@unix-os.sc.intel.com.suse.lists.linux.kernel>
2002-05-16 12:54       ` Andi Kleen
2002-05-16 14:13         ` Mark Gross
2002-05-16 17:27           ` Andi Kleen
2002-05-16 17:36             ` Daniel Jacobowitz
2002-05-16 18:08               ` Mark Gross
2002-05-16 21:32                 ` Alan Cox
2002-05-16 21:24                   ` Robert Love
2002-05-16 18:40                     ` Mark Gross
2002-05-13 19:17 Mark Gross
2002-05-14 15:35 ` Erich Focht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).