linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ptrace/coredump/exit_group deadlock
@ 2005-10-03 11:29 Andrea Arcangeli
  2005-10-07  6:24 ` quick (software versus hardware raid) question (cpu) Dan C Marinescu
  2006-04-07 10:15 ` [PATCH] ptrace/coredump/exit_group deadlock Roland McGrath
  0 siblings, 2 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2005-10-03 11:29 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel, Andrew Morton, Gerd Knorr

Hello,

I could seldom reproduce a deadlock with a task not killable in T state
(TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to
gdb, by segfaulting the task and triggering a core dump while some other
task is executing exit_group and while one task is in ptrace_attached
TASK_STOPPED state (not TASK_TRACED yet). This originated from a gdb
bugreport (the fact gdb was segfaulting the task wasn't a kernel bug),
but I just incidentally noticed the gdb bug triggered a real kernel bug
as well.

Most threads hangs in exit_mm because the core_dumping is still going,
the core dumping hangs because the stopped task doesn't exit, the
stopped task can't wakeup because it has SIGNAL_GROUP_EXIT set, hence
the deadlock.

To me it seems that the problem is that the force_sig_specific(SIGKILL)
in zap_threads is a noop if the task has PF_PTRACED set (like in this
case because gdb is attached). The __ptrace_unlink does nothing because
the signal->flags is set to SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED
(verified).

The above info also shows that the stopped task hit a race and got the
stop signal (presumably by the ptrace_attach, only the attach, state is
still TASK_STOPPED and gdb hangs waiting the core before it can set it
to TASK_TRACED) after one of the thread invoked the core dump (it's the
core dump that sets signal->flags to SIGNAL_GROUP_EXIT).

So beside the fact nobody would wakeup the task in __ptrace_unlink (the
state is _not_ TASK_TRACED), there's a secondary problem in the signal
handling code, where a task should ignore the ptrace-sigstops as long as
SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't
be enough).

So I attempted to make this patch that seems to fix the problem. There
were various ways to fix it, perhaps you prefer a different one, I just
opted to the one that looked safer to me.

I also removed the clearing of the stopped bits from the
zap_other_threads (zap_other_threads was safe unlike zap_threads). I
don't like useless code, this whole NPTL signal/ptrace thing is already
unreadable enough and full of corner cases without confusing useless
code into it to make it even less readable. And if this code is really
needed, then you may want to explain why it's not being done in the
other paths that sets SIGNAL_GROUP_EXIT at least.

Even after this patch I still wonder who serializes the read of
p->ptrace in zap_threads.

Patch is called ptrace-core_dump-exit_group-deadlock-1.

This was the trace I've got:

test          T ffff81003e8118c0     0 14305      1         14311 14309 (NOTLB)
ffff810058ccdde8 0000000000000082 000001f4000037e1 ffff810000000013
       00000000000000f8 ffff81003e811b00 ffff81003e8118c0 ffff810011362100
       0000000000000012 ffff810017ca4180
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff80141677>{finish_stop+87}
       <ffffffff8014367f>{get_signal_to_deliver+1359} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80111575>{sys_ptrace+2293}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80196399>{sys_ioctl+73}
       <ffffffff8010dd27>{sysret_signal+28} <ffffffff8010e00f>{ptregscall_common+103}

test          D ffff810011362100     0 14309      1         14305 14312 (NOTLB)
ffff810053c81cf8 0000000000000082 0000000000000286 0000000000000001
       0000000000000195 ffff810011362340 ffff810011362100 ffff81002e338040
       ffff810001e0ca80 0000000000000001
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
       <ffffffff801381af>{do_exit+479} <ffffffff80138d0c>{do_group_exit+252}
       <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2

       <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112}
       <ffffffff8010e308>{retint_signal+61}
test          D ffff81002e338040     0 14311      1         14716 14305 (NOTLB)
ffff81005ca8dcf8 0000000000000082 0000000000000286 0000000000000001
       0000000000000120 ffff81002e338280 ffff81002e338040 ffff8100481cb740
       ffff810001e0ca80 0000000000000001
Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
       <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
       <ffffffff801381af>{do_exit+479} <ffffffff80142d0e>{__dequeue_signal+558}
       <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451}
       <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222}
       <ffffffff80140850>{specific_send_sig_info+208} <ffffffff8014208a>{force_sig_info+186}
       <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61}

test          D ffff810017ca4180     0 14312      1         14309 13882 (NOTLB)
ffff81005d15fcb8 0000000000000082 ffff81005d15fc58 ffffffff80130816
       0000000000000897 ffff810017ca43c0 ffff810017ca4180 ffff81003e8118c0
       0000000000000082 ffffffff801317ed
Call Trace:<ffffffff80130816>{activate_task+150} <ffffffff801317ed>{try_to_wake_up+893}
       <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0}
       <ffffffff8018cdc3>{do_coredump+819} <ffffffff80445f52>{thread_return+82}
       <ffffffff801436d4>{get_signal_to_deliver+1444} <ffffffff8010d3ad>{do_signal+157}
       <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2

       <ffffffff804472e5>{_spin_unlock_irqrestore+5} <ffffffff8014208a>{force_sig_info+186}
       <ffffffff804476ff>{do_general_protection+159} <ffffffff8010e308>{retint_signal+61}



Comments welcome, thanks!

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

diff -r 7c862f664e58 kernel/ptrace.c
--- a/kernel/ptrace.c	Sun Oct  2 00:02:13 2005
+++ b/kernel/ptrace.c	Mon Oct  3 03:32:50 2005
@@ -56,6 +56,10 @@
 			signal_wake_up(child, 1);
 		}
 	}
+	if (child->signal->flags & SIGNAL_GROUP_EXIT) {
+		sigaddset(&child->pending.signal, SIGKILL);
+		signal_wake_up(child, 1);
+	}
 	spin_unlock(&child->sighand->siglock);
 }
 
@@ -77,8 +81,7 @@
 		SET_LINKS(child);
 	}
 
-	if (child->state == TASK_TRACED)
-		ptrace_untrace(child);
+	ptrace_untrace(child);
 }
 
 /*
diff -r 7c862f664e58 kernel/signal.c
--- a/kernel/signal.c	Sun Oct  2 00:02:13 2005
+++ b/kernel/signal.c	Mon Oct  3 03:32:50 2005
@@ -1118,8 +1118,8 @@
 		if (t != p->group_leader)
 			t->exit_signal = -1;
 
+		/* SIGKILL will be handled before any pending SIGSTOP */
 		sigaddset(&t->pending.signal, SIGKILL);
-		rm_from_queue(SIG_KERNEL_STOP_MASK, &t->pending);
 		signal_wake_up(t, 1);
 	}
 }
@@ -1856,9 +1856,9 @@
 			/* Let the debugger run.  */
 			ptrace_stop(signr, signr, info);
 
-			/* We're back.  Did the debugger cancel the sig?  */
+			/* We're back.  Did the debugger cancel the sig or group_exit? */
 			signr = current->exit_code;
-			if (signr == 0)
+			if (signr == 0 || current->signal->flags & SIGNAL_GROUP_EXIT)
 				continue;
 
 			current->exit_code = 0;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* quick (software versus hardware raid) question (cpu)
  2005-10-03 11:29 [PATCH] ptrace/coredump/exit_group deadlock Andrea Arcangeli
@ 2005-10-07  6:24 ` Dan C Marinescu
  2005-10-07  6:47   ` Dan C Marinescu
  2006-04-07 10:15 ` [PATCH] ptrace/coredump/exit_group deadlock Roland McGrath
  1 sibling, 1 reply; 4+ messages in thread
From: Dan C Marinescu @ 2005-10-07  6:24 UTC (permalink / raw)
  To: linux-kernel

hi there,

i have a mission critical system with an unsuported
raid controller (4 sata channels, sil 3114). so, i set
up a software raid... now, the problem is at pick
hours, when nics hardware interrupts compete with the
software raid kernel thingie... i know you avoid
spawning threads but regarless the scheduller's
preeptivenes, the two kernel tasks are scheduller-wise
competing and this is killing my server (slowing it
down, cpu is at 100% user and 40-60% kernel). another
issue is that not being able to "see" the "bios"
partition (controller is totally unsuported) i cannot
boot (unless i go like raid 1 on /boot). so what about
performance? what about scalability? if case of 1
array (hardware raid) versus n distinct sata
connectors, you big o notation goes like O(n) instead
of O(1), no matter how smart the scheduller _is_
implemented, for the simple reason that hardware
interrupts are hardware interrupts... preemptive or
not, you have to serve them sooner or later and it's
one thing to sever 1 instead of n... another issue is
reliability. if you use software raid, and the kernel
goes down (for unrelated reasons) your parity
calculations and the raid cache go down as well, huh?
and the other way around... the raid goes down, taking
the sata driver with him... that takes the whole
system down, huh? well, maybe i am too pesimistic but
still, the scalability concern remains! nics are huge
scheduller enamies... they have to do so much with
cpu, and transfers to system memory, in order to
actually server thier purpose (protocols) it seams to
me that there isn't much left for user-land and
software raid... (especially when many nics and many
satas are involved...)

in short, could you recommend me a peformant (sata
connectors) raid controller, which is fully supported?
please don't go like "make gconfig" cause i've been
there... thanks!

regards,
  daniel


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: quick (software versus hardware raid) question (cpu)
  2005-10-07  6:24 ` quick (software versus hardware raid) question (cpu) Dan C Marinescu
@ 2005-10-07  6:47   ` Dan C Marinescu
  0 siblings, 0 replies; 4+ messages in thread
From: Dan C Marinescu @ 2005-10-07  6:47 UTC (permalink / raw)
  To: Dan C Marinescu, linux-kernel

it seams to me tha SCSI_SATA_SIL works with Sil 3114!

// happy now :-)

   daniel

--- Dan C Marinescu <dan_c_marinescu@yahoo.com> wrote:

> hi there,
> 
> i have a mission critical system with an unsuported
> raid controller (4 sata channels, sil 3114). so, i
> set
> up a software raid... now, the problem is at pick
> hours, when nics hardware interrupts compete with
> the
> software raid kernel thingie... i know you avoid
> spawning threads but regarless the scheduller's
> preeptivenes, the two kernel tasks are
> scheduller-wise
> competing and this is killing my server (slowing it
> down, cpu is at 100% user and 40-60% kernel).
> another
> issue is that not being able to "see" the "bios"
> partition (controller is totally unsuported) i
> cannot
> boot (unless i go like raid 1 on /boot). so what
> about
> performance? what about scalability? if case of 1
> array (hardware raid) versus n distinct sata
> connectors, you big o notation goes like O(n)
> instead
> of O(1), no matter how smart the scheduller _is_
> implemented, for the simple reason that hardware
> interrupts are hardware interrupts... preemptive or
> not, you have to serve them sooner or later and it's
> one thing to sever 1 instead of n... another issue
> is
> reliability. if you use software raid, and the
> kernel
> goes down (for unrelated reasons) your parity
> calculations and the raid cache go down as well,
> huh?
> and the other way around... the raid goes down,
> taking
> the sata driver with him... that takes the whole
> system down, huh? well, maybe i am too pesimistic
> but
> still, the scalability concern remains! nics are
> huge
> scheduller enamies... they have to do so much with
> cpu, and transfers to system memory, in order to
> actually server thier purpose (protocols) it seams
> to
> me that there isn't much left for user-land and
> software raid... (especially when many nics and many
> satas are involved...)
> 
> in short, could you recommend me a peformant (sata
> connectors) raid controller, which is fully
> supported?
> please don't go like "make gconfig" cause i've been
> there... thanks!
> 
> regards,
>   daniel
> 
> 
> 		
> __________________________________ 
> Yahoo! Mail - PC Magazine Editors' Choice 2005 
> http://mail.yahoo.com
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ptrace/coredump/exit_group deadlock
  2005-10-03 11:29 [PATCH] ptrace/coredump/exit_group deadlock Andrea Arcangeli
  2005-10-07  6:24 ` quick (software versus hardware raid) question (cpu) Dan C Marinescu
@ 2006-04-07 10:15 ` Roland McGrath
  1 sibling, 0 replies; 4+ messages in thread
From: Roland McGrath @ 2006-04-07 10:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Gerd Knorr, linux-kernel, Andrew Morton, Linus Torvalds,
	Oleg Nesterov, Wu Zhou

Sorry I have been absent from the discussion for such a long time.
I'm trying to catch up on old issues that I should have reviewed earlier.

This is about Andrea's change:

	commit 30e0fca6c1d7d26f3f2daa4dd2b12c51dadc778a
	Author: Andrea Arcangeli <andrea@suse.de>
	Date:   Sun Oct 30 15:02:38 2005 -0800

	    [PATCH] ptrace/coredump/exit_group deadlock

I am quite dubious about this change, and a little confused about the bug
it's addressing.  The change broke the non-leader MT exec case when
ptraced, and this is still broken in 2.6.16.1; the execing thread does
ptrace_unlink in de_thread and winds up SIGKILL'ing itself so it dies as
soon as it finishes the exec.  (Wu Zhou posted about this problem, but I
didn't see that he got any response.  To reproduce the problem, see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177240.)  With the
aforementioned commit reverted, the exec problem goes away.  (2.6.17-rc1
no longer suffers this symptom because of other cleanups in de_thread.
But that only further masks the real issues.)

Do you have a self-contained test program to exercise the original bug
you were dealing with?  It looks like maybe you had one, but I didn't see
where you posted it.  I was not really able to follow the description of
the scenario precisely enough to understand the situation.  (I think you
are saying that all the threads you mention are in the same thread group,
with PTRACE_ATTACH from one thread to another in the group, but I am not
entirely sure.)

> I could seldom reproduce a deadlock with a task not killable in T state
> (TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to
> gdb, by segfaulting the task and triggering a core dump while some other
> task is executing exit_group and while one task is in ptrace_attached
> TASK_STOPPED state (not TASK_TRACED yet). 

I don't understand this part: "TASK_STOPPED state (not TASK_TRACED yet)".
What does that mean?  The only threads that should have a chance to enter
TASK_STOPPED are those that are not ptrace'd, but "not TASK_TRACED yet"
would obviously not apply to those, so I am confused as to what you mean.

> Most threads hangs in exit_mm because the core_dumping is still going,
> the core dumping hangs because the stopped task doesn't exit, the
> stopped task can't wakeup because it has SIGNAL_GROUP_EXIT set, hence
> the deadlock.

I also don't understand this: "the stopped task can't wakeup because it
has SIGNAL_GROUP_EXIT set".  

> To me it seems that the problem is that the force_sig_specific(SIGKILL)
> in zap_threads is a noop if the task has PF_PTRACED set (like in this
> case because gdb is attached). 

This should not be so.  I agree that this is a problem if it's happening.
The SIGKILL ought to wake up any thread that is stopped in whichever
state.  I suspect that what you saw is the scenario that Oleg described
(and fixed), in which a SIGKILL is left pending before the thread stops.
Later SIGKILLs are then no-ops because there is already one pending, but
that is not because of PT_PTRACED (and indeed it need not have it set for
this failure mode, and it's more likely if it doesn't).

> The __ptrace_unlink does nothing because the signal->flags is set to
> SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED (verified).

I think here you must be talking about the "if (unlikely(traced))" case
in zap_threads.  Is that right?  Sorry, I'm really having a hard time
understanding your description of the situation.  I'm sure I'm being dense.

This combination of bits should never be possible.  Since you verified
it, I wonder if that might have been on a kernel before this change,
which fixes a clearly related bug:

	commit 788e05a67c343fa22f2ae1d3ca264e7f15c25eaf
	Author: Oleg Nesterov <oleg@tv-sign.ru>
	Date:   Fri Oct 7 17:46:19 2005 +0400

	    [PATCH] fix do_coredump() vs SIGSTOP race

	diff --git a/kernel/signal.c b/kernel/signal.c
	index 619b027..c135f5a 100644
	--- a/kernel/signal.c
	+++ b/kernel/signal.c
	@@ -578,7 +578,8 @@ int dequeue_signal(struct task_struct *t
			 * is to alert stop-signal processing code when another
			 * processor has come along and cleared the flag.
			 */
	- 		tsk->signal->flags |= SIGNAL_STOP_DEQUEUED;
	+ 		if (!(tsk->signal->flags & SIGNAL_GROUP_EXIT))
	+ 			tsk->signal->flags |= SIGNAL_STOP_DEQUEUED;
		}
		if ( signr &&
		     ((info->si_code & __SI_MASK) == __SI_TIMER) &&

> The above info also shows that the stopped task hit a race and got the
> stop signal (presumably by the ptrace_attach, only the attach, state is
> still TASK_STOPPED and gdb hangs waiting the core before it can set it
> to TASK_TRACED) after one of the thread invoked the core dump (it's the
> core dump that sets signal->flags to SIGNAL_GROUP_EXIT).

Sorry again, I'm having a lot of trouble understanding some of this
paragraph, mostly the first parenthetical part.  Could you restate what
you think is going on here?

> So beside the fact nobody would wakeup the task in __ptrace_unlink (the
> state is _not_ TASK_TRACED), there's a secondary problem in the signal
> handling code, where a task should ignore the ptrace-sigstops as long as
> SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't
> be enough).

I don't think this is actually a problem.  After Oleg's fix, I believe it
is impossible to get a SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED combination.
That being the case, the check in do_signal_stop should prevent the thread
that dequeued a SIGSTOP from acting on it.  In the case of some other
signal passed through by ptrace, it will have its normal action before
getting the the group exit, which is fine.  The thread should not stop for
any reason here and will then go back and see the SIGKILL next.  I don't
think your change in get_signal_to_deliver does any real harm, but I don't
see any reason to put an extra check here just to short-circuit the code
path in this rare race condition.

> So I attempted to make this patch that seems to fix the problem. There
> were various ways to fix it, perhaps you prefer a different one, I just
> opted to the one that looked safer to me.

I am still sufficiently unclear on what your analysis of the situation is
that I cannot be sure I am addressing all of the problems.  I do
understand the scenario Oleg described (see the full log message for his
change).  If what you observed was not different from that, then I think
that your changes to ptrace_untrace and get_signal_to_deliver should be
reverted.  If there is more going on in your case than what I understand,
then please help me understand it.

> I also removed the clearing of the stopped bits from the
> zap_other_threads (zap_other_threads was safe unlike zap_threads). 

This change in zap_other_threads seems fine.  I think that it was just
written trying to be safe without the assumption that SIGKILL will always
be selected from the queue in preference to any stop signals.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-04-07 10:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-03 11:29 [PATCH] ptrace/coredump/exit_group deadlock Andrea Arcangeli
2005-10-07  6:24 ` quick (software versus hardware raid) question (cpu) Dan C Marinescu
2005-10-07  6:47   ` Dan C Marinescu
2006-04-07 10:15 ` [PATCH] ptrace/coredump/exit_group deadlock Roland McGrath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).