linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG at kernel/exit.c:792!
@ 2003-12-03 10:08 Srivatsa Vaddagiri
       [not found] ` <3FCDCEA3.1020209@mailandnews.com>
  2003-12-03 15:51 ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2003-12-03 10:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: lhcs-devel

Hi,
	I hit a kernel BUG while running some stress tests
on a SMP machine. Details are below:

Kernel	:  2.6.0-test9-bk23  + CPU Hotplug Patch
Machine	:  Intel 4-Way SMP box 

I don't think this problem is related in any way to
the CPU Hotplug patch I had applied. It could be hit
w/o that patch applied also(?)


------------[ cut here ]------------
kernel BUG at kernel/exit.c:792!
invalid operand: 0000 [#1]
CPU:    1
EIP:    0060:[<c0124026>]    Not tainted
EFLAGS: 00010246
EIP is at next_thread+0x16/0x50
eax: 00000000   ebx: f68726b0   ecx: f68727ac   edx: f6872794
esi: 00006a4e   edi: 00000001   ebp: 00000000   esp: d6b35ed0
ds: 007b   es: 007b   ss: 0068
Process find (pid: 27213, threadinfo=d6b34000 task=ee26e080)
Stack: c0180328 f68726b0 e26a7a80 ed5a0390 00000003 00000000 c0180524 00000003 
       d6b35f14 ed5a0390 d6b35f04 00000000 00000001 00000000 32373200 6a4e3431 
       0000416d 00006a4e 00000000 00000000 00000000 00000000 00000000 00000000 
Call Trace:
 [<c0180328>] get_tid_list+0x58/0x70
 [<c0180524>] proc_task_readdir+0xc4/0x17c
 [<c01658dc>] vfs_readdir+0x5c/0x70
 [<c0165be0>] filldir64+0x0/0x120
 [<c0165d64>] sys_getdents64+0x64/0xa3
 [<c0165be0>] filldir64+0x0/0x120
 [<c0109291>] sysenter_past_esp+0x52/0x71

Code: 0f 0b 18 03 68 49 38 c0 0f b6 80 04 05 00 00 84 c0 7e 14 a1 

I suspect this is because when read_lock call in 'get_tid_list'
returns, the leader_task had exited already. This
causes the NULL sighand check to fail in the subsequent call
to 'next_thread' ?

Does it make sense to check for leader_task being alive
after the tasklist lock is grabbed and return immediately
if it is not alive (as the patch below does)?



 fs/proc/base.c |    3 +++
 1 files changed, 3 insertions(+)

diff -puN fs/proc/base.c~proc-get_tid_list-fix fs/proc/base.c
--- linux-2.6.0-test11/fs/proc/base.c~proc-get_tid_list-fix	2003-12-03 14:55:53.000000000 +0530
+++ linux-2.6.0-test11-vatsa/fs/proc/base.c	2003-12-03 14:56:20.000000000 +0530
@@ -1666,6 +1666,8 @@ static int get_tid_list(int index, unsig
 
 	index -= 2;
 	read_lock(&tasklist_lock);
+	if (!pid_alive(task))
+		goto exit;
 	do {
 		int tid = task->pid;
 		if (!pid_alive(task))
@@ -1677,6 +1679,7 @@ static int get_tid_list(int index, unsig
 		if (nr_tids >= PROC_MAXPIDS)
 			break;
 	} while ((task = next_thread(task)) != leader_task);
+exit:
 	read_unlock(&tasklist_lock);
 	return nr_tids;
 }

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560033

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
       [not found] ` <3FCDCEA3.1020209@mailandnews.com>
@ 2003-12-03 12:53   ` Srivatsa Vaddagiri
  2003-12-03 17:46     ` Manfred Spraul
  2003-12-03 19:59     ` Ingo Molnar
  0 siblings, 2 replies; 17+ messages in thread
From: Srivatsa Vaddagiri @ 2003-12-03 12:53 UTC (permalink / raw)
  To: Raj; +Cc: linux-kernel, lhcs-devel, manfred


On Wed, Dec 03, 2003 at 05:23:07PM +0530, Raj wrote:
> 
> >Does it make sense to check for leader_task being alive
> >after the tasklist lock is grabbed and return immediately
> >if it is not alive (as the patch below does)?
> >
> >  
> >
> maybe i am wrong, but wouldnt a 'break' in the do-while suffice rather 
> than a goto ?
> 

I was not sure if the pid_alive check inside the do-while loop
is for leader_task only or for non-leader tasks also. 
If that check is for non-leader tasks also, then we would
like to retain it still ..

> /Raj

> --- base.c	2003-10-26 00:13:57.000000000 +0530
> +++ base.c.fix	2003-12-03 17:20:18.877679360 +0530
> @@ -1669,7 +1669,7 @@
>  	do {
>  		int tid = task->pid;
>  		if (!pid_alive(task))
> -			continue;
> +			break;
>  		if (--index >= 0)
>  			continue;
>  		tids[nr_tids] = tid;


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560033

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 10:08 kernel BUG at kernel/exit.c:792! Srivatsa Vaddagiri
       [not found] ` <3FCDCEA3.1020209@mailandnews.com>
@ 2003-12-03 15:51 ` Linus Torvalds
  2003-12-03 20:03   ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2003-12-03 15:51 UTC (permalink / raw)
  To: Srivatsa Vaddagiri; +Cc: Kernel Mailing List, lhcs-devel, Ingo Molnar



On Wed, 3 Dec 2003, Srivatsa Vaddagiri wrote:
>
> 	I hit a kernel BUG while running some stress tests
> on a SMP machine. Details are below:
>
> Kernel	:  2.6.0-test9-bk23  + CPU Hotplug Patch
> Machine	:  Intel 4-Way SMP box
>
> kernel BUG at kernel/exit.c:792!
> EIP is at next_thread+0x16/0x50
> Call Trace:
>  [<c0180328>] get_tid_list+0x58/0x70
>  [<c0180524>] proc_task_readdir+0xc4/0x17c
>  [<c01658dc>] vfs_readdir+0x5c/0x70
>  [<c0165be0>] filldir64+0x0/0x120
>  [<c0165d64>] sys_getdents64+0x64/0xa3
>  [<c0165be0>] filldir64+0x0/0x120
>  [<c0109291>] sysenter_past_esp+0x52/0x71
>
> I suspect this is because when read_lock call in 'get_tid_list'
> returns, the leader_task had exited already. This
> causes the NULL sighand check to fail in the subsequent call
> to 'next_thread' ?

Yup, looks right.

I think the problem is the BUG() itself, not really the caller. So I'd
prefer the fix for this to be to just entirely remove the debug tests
withing that "#ifdef CONFIG_SMP", rather than hide the threads from /proc
when this happens.

Ingo, comments?

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 12:53   ` Srivatsa Vaddagiri
@ 2003-12-03 17:46     ` Manfred Spraul
  2003-12-03 17:52       ` Manfred Spraul
  2003-12-03 18:19       ` Linus Torvalds
  2003-12-03 19:59     ` Ingo Molnar
  1 sibling, 2 replies; 17+ messages in thread
From: Manfred Spraul @ 2003-12-03 17:46 UTC (permalink / raw)
  To: vatsa; +Cc: Raj, linux-kernel, lhcs-devel

Srivatsa Vaddagiri wrote:

> --- base.c 2003-10-26 00:13:57.000000000 +0530
>
>>+++ base.c.fix	2003-12-03 17:20:18.877679360 +0530
>>@@ -1669,7 +1669,7 @@
>> 	do {
>> 		int tid = task->pid;
>> 		if (!pid_alive(task))
>>-			continue;
>>+			break;
>>
No, a break would be wrong: The test detects already dead tasks that are 
still listed in the task list. If such a task is found, then it 
shouldn't be listed in /proc/, but the readdir call should continue to 
scan the task list.

But I don't understand the oops:
__exit_sighand clears current->sighand, and then in the next line 
__unhash_process removes the thread from the task list. But that's under 
write_lock_irq(&tasklist_lock), and get_tid_list runs under 
read_lock(&tasklist_lock). It should be impossible that ->sighand is 
NULL and the task is still listed in the task list.

--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 17:46     ` Manfred Spraul
@ 2003-12-03 17:52       ` Manfred Spraul
  2003-12-03 18:19       ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Manfred Spraul @ 2003-12-03 17:52 UTC (permalink / raw)
  Cc: vatsa, Raj, linux-kernel, lhcs-devel

Manfred Spraul wrote:

> No, a break would be wrong.

You are right, the break is correct. tid is thread id, not thread group id.

Sorry for the noise.
--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 17:46     ` Manfred Spraul
  2003-12-03 17:52       ` Manfred Spraul
@ 2003-12-03 18:19       ` Linus Torvalds
  2003-12-03 19:58         ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2003-12-03 18:19 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: vatsa, Raj, linux-kernel, lhcs-devel



On Wed, 3 Dec 2003, Manfred Spraul wrote:
>
> But I don't understand the oops:
> __exit_sighand clears current->sighand, and then in the next line
> __unhash_process removes the thread from the task list. But that's under
> write_lock_irq(&tasklist_lock), and get_tid_list runs under
> read_lock(&tasklist_lock). It should be impossible that ->sighand is
> NULL and the task is still listed in the task list.

The /proc filesystem will keep pointers to processes alive, and can reach
them even if the process is otherwise gone.

This is why /proc ends up doing tests like "if (tsk->mm)" etc - because it
literally can see processes after they are dead.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 18:19       ` Linus Torvalds
@ 2003-12-03 19:58         ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 19:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Manfred Spraul, vatsa, Raj, linux-kernel, lhcs-devel


On Wed, 3 Dec 2003, Linus Torvalds wrote:

> > But I don't understand the oops:
> > __exit_sighand clears current->sighand, and then in the next line
> > __unhash_process removes the thread from the task list. But that's under
> > write_lock_irq(&tasklist_lock), and get_tid_list runs under
> > read_lock(&tasklist_lock). It should be impossible that ->sighand is
> > NULL and the task is still listed in the task list.
> 
> The /proc filesystem will keep pointers to processes alive, and can
> reach them even if the process is otherwise gone.
> 
> This is why /proc ends up doing tests like "if (tsk->mm)" etc - because
> it literally can see processes after they are dead.

yes, and we start the 'task list search' at leader_task, which might be an
already unlinked task. So i'd suggest to use a variant of Srivatsa's fix
(attached below) - the extra explanation at this place cannot hurt i
think.

	Ingo

--- linux/fs/proc/base.c.orig	
+++ linux/fs/proc/base.c	
@@ -1666,7 +1666,12 @@ static int get_tid_list(int index, unsig
 
 	index -= 2;
 	read_lock(&tasklist_lock);
-	do {
+	/*
+	 * The starting point task (leader_task) might be an already
+	 * unlinked task, which cannot be used to access the task-list
+	 * via next_thread().
+	 */
+	if (pid_alive(task)) do {
 		int tid = task->pid;
 		if (!pid_alive(task))
 			continue;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 12:53   ` Srivatsa Vaddagiri
  2003-12-03 17:46     ` Manfred Spraul
@ 2003-12-03 19:59     ` Ingo Molnar
  2003-12-03 20:08       ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 19:59 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Raj, linux-kernel, lhcs-devel, Manfred Spraul, Linus Torvalds


On Wed, 3 Dec 2003, Srivatsa Vaddagiri wrote:

> > maybe i am wrong, but wouldnt a 'break' in the do-while suffice rather 
> > than a goto ?
> 
> I was not sure if the pid_alive check inside the do-while loop is for
> leader_task only or for non-leader tasks also.  If that check is for
> non-leader tasks also, then we would like to retain it still ..

only the starting point should be checked. If the starting point is wrong
then we have no access to the 'thread list' anymore. If the starting point
is alive then all the thread-list walking within the tasklist_lock is
safe.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 15:51 ` Linus Torvalds
@ 2003-12-03 20:03   ` Ingo Molnar
  2003-12-03 20:09     ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 20:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Srivatsa Vaddagiri, Kernel Mailing List, lhcs-devel


On Wed, 3 Dec 2003, Linus Torvalds wrote:

> I think the problem is the BUG() itself, not really the caller. So I'd
> prefer the fix for this to be to just entirely remove the debug tests
> withing that "#ifdef CONFIG_SMP", rather than hide the threads from
> /proc when this happens.
> 
> Ingo, comments?

i'd like to keep the BUG() - it has caught this bug, and it has caught
other bugs too in the past. The next_thread() use of procfs is clearly
illegal if it happens after a thread has been removed from the tasklist.
procfs is really special in this regard (nothing else that accesses the
thread list is supposed to have a stale reference to the task).

removing the debug tests will cause further crashes i think - the thread
pointers are not valid anymore, they could point to truly freed tasks.  
(this particular task is not freed yet because it has a procfs reference -
but the thread list pointers are not valid anymore.)

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 19:59     ` Ingo Molnar
@ 2003-12-03 20:08       ` Linus Torvalds
  2003-12-03 20:18         ` Ingo Molnar
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-12-03 20:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel, Manfred Spraul



On Wed, 3 Dec 2003, Ingo Molnar wrote:
>
> only the starting point should be checked. If the starting point is wrong
> then we have no access to the 'thread list' anymore. If the starting point
> is alive then all the thread-list walking within the tasklist_lock is
> safe.

I'm not sure that's true. The starting point is not necessarily the thread
group leader, and then following the chain can see a zombie thread group
leader in the _middle_ without a sighandler pointer - at which point we
BUG() out again.

Guys, that BUG() is _incorrect_.

This is one reason I hate some assert() with a passion. I've seen this way
too often: somebody added an assert for something he thought was a bug,
and then people are too damn afraid to just admit that it wasn't a bug at
all, and just get rid of the f-ing assert. So instead, you add code to
avoid the assert. And that code itself is non-obvious and broken.

So as far as I can tell, the patch from Ingo and Srivatsa just paper over
the _real_ bug. And the real fix is to get rid of the debugging helper
completely, since it no longer serves any purpose, and it is WRONG!

So tell me why it isn't wrong?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:03   ` Ingo Molnar
@ 2003-12-03 20:09     ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-12-03 20:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Kernel Mailing List, lhcs-devel


On Wed, 3 Dec 2003, Ingo Molnar wrote:
>
> i'd like to keep the BUG() - it has caught this bug

It did nothing of the sort. It didn't catch a bug at all, it _caused_ a
bug.

> The next_thread() use of procfs is clearly illegal if it happens after a
> thread has been removed from the tasklist.

Why? I disagree with the "clearly". Even if it were illegal, it sure as
hell is not "clear".

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:08       ` Linus Torvalds
@ 2003-12-03 20:18         ` Ingo Molnar
  2003-12-03 20:19         ` Manfred Spraul
  2003-12-03 20:35         ` Ingo Molnar
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel, Manfred Spraul


On Wed, 3 Dec 2003, Linus Torvalds wrote:

> > only the starting point should be checked. If the starting point is wrong
> > then we have no access to the 'thread list' anymore. If the starting point
> > is alive then all the thread-list walking within the tasklist_lock is
> > safe.
> 
> I'm not sure that's true. The starting point is not necessarily the
> thread group leader, and then following the chain can see a zombie
> thread group leader in the _middle_ without a sighandler pointer - at
> which point we BUG() out again.

hm, get_tid_list() is used in proc_task_readdir(), which is
/proc/<TGID>/task/ directory - so proc_task(dir) is a thread group leader
by definition, right?

also, a zombie thread group leader does not mean it's removed from the
thread list - we keep it around hashed just to enable the 'process' to
live along and only release it once the last thread has gone too (by
delaying parent notification up to the last thread has exited).

But this should be irrelevant in this case, i think the crash here is
stale access to the thread leader via /proc/<TGID>, and trying to follow
that non-existent thread list pointer.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:08       ` Linus Torvalds
  2003-12-03 20:18         ` Ingo Molnar
@ 2003-12-03 20:19         ` Manfred Spraul
  2003-12-03 20:26           ` Ingo Molnar
  2003-12-03 20:35         ` Ingo Molnar
  2 siblings, 1 reply; 17+ messages in thread
From: Manfred Spraul @ 2003-12-03 20:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel

Linus Torvalds wrote:

>So as far as I can tell, the patch from Ingo and Srivatsa just paper over
>the _real_ bug. And the real fix is to get rid of the debugging helper
>completely, since it no longer serves any purpose, and it is WRONG!
>
>So tell me why it isn't wrong?
>  
>
It's wrong, because next_thread() relies on

    task->pids[PIDTYPE_TGID].pid_chain.next

That pointer is not valid after detach_pid(task, PIDTYPE_TGID), and 
that's called within __unhash_process.  Thus next_thread() fails if it's 
called on a dead task. Srivatsa's second patch is the right change: If 
pid_alive() is wrong, then break from the loop without calling 
next_thread().

--
    Manfred


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:19         ` Manfred Spraul
@ 2003-12-03 20:26           ` Ingo Molnar
  2003-12-03 20:31             ` Ingo Molnar
  2003-12-03 20:55             ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 20:26 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Linus Torvalds, Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel


On Wed, 3 Dec 2003, Manfred Spraul wrote:

> It's wrong, because next_thread() relies on
> 
>     task->pids[PIDTYPE_TGID].pid_chain.next
> 
> That pointer is not valid after detach_pid(task, PIDTYPE_TGID), and
> that's called within __unhash_process.  Thus next_thread() fails if it's
> called on a dead task. Srivatsa's second patch is the right change: If
> pid_alive() is wrong, then break from the loop without calling
> next_thread().

yes. And for thread groups this can only happen for the thread group
leader if all 'child' threads have exited. So it can never happen that we
somehow get to a 'middle' thread, walk the chain and get to a task that is
invalid. The only possibility is that the starting point is stale itself -
and the pid_alive() test checks that.

the thread group leader being 'zombie' does not mean it's unhashed. Thread
group leaders are never detached threads, they have a parent that waits
for them. So these zombies just hang around until the last thread goes
away, and then the leader is released, unhashed from the PID space (and
thus next_thread() stops being valid) and the parent is notified.

	Ingo

--- linux/fs/proc/base.c.orig	
+++ linux/fs/proc/base.c	
@@ -1666,7 +1666,12 @@ static int get_tid_list(int index, unsig
 
 	index -= 2;
 	read_lock(&tasklist_lock);
-	do {
+	/*
+	 * The starting point task (leader_task) might be an already
+	 * unlinked task, which cannot be used to access the task-list
+	 * via next_thread().
+	 */
+	if (pid_alive(task)) do {
 		int tid = task->pid;
 		if (!pid_alive(task))
 			continue;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:26           ` Ingo Molnar
@ 2003-12-03 20:31             ` Ingo Molnar
  2003-12-03 20:55             ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 20:31 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Linus Torvalds, Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel


i didnt notice the pid_alive() check within the loop - it's incorrect. If
we are within the tasklist lock and the thread group leader is valid then
the thread chain should be fully intact. The patch that i believe fixes
the bug the right way is below.

	Ingo

--- linux/fs/proc/base.c.orig	
+++ linux/fs/proc/base.c	
@@ -1666,10 +1666,14 @@ static int get_tid_list(int index, unsig
 
 	index -= 2;
 	read_lock(&tasklist_lock);
-	do {
+	/*
+	 * The starting point task (leader_task) might be an already
+	 * unlinked task, which cannot be used to access the task-list
+	 * via next_thread().
+	 */
+	if (pid_alive(task)) do {
 		int tid = task->pid;
-		if (!pid_alive(task))
-			continue;
+
 		if (--index >= 0)
 			continue;
 		tids[nr_tids] = tid;
	

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:08       ` Linus Torvalds
  2003-12-03 20:18         ` Ingo Molnar
  2003-12-03 20:19         ` Manfred Spraul
@ 2003-12-03 20:35         ` Ingo Molnar
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2003-12-03 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel, Manfred Spraul


On Wed, 3 Dec 2003, Linus Torvalds wrote:

> This is one reason I hate some assert() with a passion. I've seen this
> way too often: somebody added an assert for something he thought was a
> bug, and then people are too damn afraid to just admit that it wasn't a
> bug at all, and just get rid of the f-ing assert. So instead, you add
> code to avoid the assert. And that code itself is non-obvious and
> broken.

i share your concern wrt. the abuse of asserts, but a few strategic
BUG_ON()s and WARN_ON()s have done wonders to 2.6 quality already. There
are certain types of bugs that are very hairy to detect if not caught
early enough. Stale thread pointers are one such thing.

in this specific case i doubt we'd be in a better position by not having
this assert. get_tid_list() would not crash the way it asserts now,
because task structs are not cleared after freeing. It might even lead to
incorrect thread lists being generated, if the task struct got reused for
another process. Or if it's reused for something else then we'd get much
rarer crashes in get_tid_list().

i think i'd rather like to see a few bad rounds of attempted fixes than
(much-) harder to reproduce bugs.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kernel BUG at kernel/exit.c:792!
  2003-12-03 20:26           ` Ingo Molnar
  2003-12-03 20:31             ` Ingo Molnar
@ 2003-12-03 20:55             ` Linus Torvalds
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2003-12-03 20:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Manfred Spraul, Srivatsa Vaddagiri, Raj, linux-kernel, lhcs-devel



On Wed, 3 Dec 2003, Ingo Molnar wrote:
>
> On Wed, 3 Dec 2003, Manfred Spraul wrote:
>
> > It's wrong, because next_thread() relies on
> >
> >     task->pids[PIDTYPE_TGID].pid_chain.next
> >
> > That pointer is not valid after detach_pid(task, PIDTYPE_TGID), and
> > that's called within __unhash_process.  Thus next_thread() fails if it's
> > called on a dead task. Srivatsa's second patch is the right change: If
> > pid_alive() is wrong, then break from the loop without calling
> > next_thread().
>
> yes. And for thread groups this can only happen for the thread group
> leader if all 'child' threads have exited.

Ok, color me convinced. Will apply,

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2003-12-03 20:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-03 10:08 kernel BUG at kernel/exit.c:792! Srivatsa Vaddagiri
     [not found] ` <3FCDCEA3.1020209@mailandnews.com>
2003-12-03 12:53   ` Srivatsa Vaddagiri
2003-12-03 17:46     ` Manfred Spraul
2003-12-03 17:52       ` Manfred Spraul
2003-12-03 18:19       ` Linus Torvalds
2003-12-03 19:58         ` Ingo Molnar
2003-12-03 19:59     ` Ingo Molnar
2003-12-03 20:08       ` Linus Torvalds
2003-12-03 20:18         ` Ingo Molnar
2003-12-03 20:19         ` Manfred Spraul
2003-12-03 20:26           ` Ingo Molnar
2003-12-03 20:31             ` Ingo Molnar
2003-12-03 20:55             ` Linus Torvalds
2003-12-03 20:35         ` Ingo Molnar
2003-12-03 15:51 ` Linus Torvalds
2003-12-03 20:03   ` Ingo Molnar
2003-12-03 20:09     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).