All of lore.kernel.org
 help / color / mirror / Atom feed
* is killing zombies possible w/o a reboot?
@ 2004-11-03 12:51 Gene Heskett
  2004-11-03 14:33 ` bert hubert
  2004-11-03 20:48 ` Tom Felker
  0 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 12:51 UTC (permalink / raw)
  To: linux-kernel

Greetings;

I thought I'd get caught up on -bkx kernels and made a -bk8 just now.

But I'd tried to run gnomeradio earlier to listen to the elections, 
but it failed leaving to run, as did tvtime then too, claiming it 
couldn't get a lock on /dev/video0, and gnomeradio apparently left a 
lock on alsasound that prevented the normal gracefull shutdown by 
locking up the shutdown on the "stopping alsasound" line.  So I had 
to use the hardware reset.

I'd tried to kill the zombie earlier but couldn't.

Isn't there some way to clean up a &^$#^#@)_ zombie?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 12:51 is killing zombies possible w/o a reboot? Gene Heskett
@ 2004-11-03 14:33 ` bert hubert
  2004-11-03 14:49   ` Måns Rullgård
  2004-11-03 16:24   ` Gene Heskett
  2004-11-03 20:48 ` Tom Felker
  1 sibling, 2 replies; 99+ messages in thread
From: bert hubert @ 2004-11-03 14:33 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel

On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:

> But I'd tried to run gnomeradio earlier to listen to the elections, 

Depressing enough.

> I'd tried to kill the zombie earlier but couldn't.
> Isn't there some way to clean up a &^$#^#@)_ zombie?

Kill the parent, is the only (portable) way.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 14:33 ` bert hubert
@ 2004-11-03 14:49   ` Måns Rullgård
  2004-11-03 15:25     ` DervishD
  2004-11-03 16:38     ` Gene Heskett
  2004-11-03 16:24   ` Gene Heskett
  1 sibling, 2 replies; 99+ messages in thread
From: Måns Rullgård @ 2004-11-03 14:49 UTC (permalink / raw)
  To: linux-kernel

bert hubert <ahu@ds9a.nl> writes:

> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>
>> But I'd tried to run gnomeradio earlier to listen to the elections, 
>
> Depressing enough.
>
>> I'd tried to kill the zombie earlier but couldn't.
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
> Kill the parent, is the only (portable) way.

Perhaps not as portable, but another possible, though slightly
complicated, way is to ptrace the parent and force it to wait().

-- 
Måns Rullgård
mru@inprovide.com


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 15:25     ` DervishD
@ 2004-11-03 15:25       ` Måns Rullgård
  2004-11-03 17:49         ` DervishD
  2004-11-03 16:47       ` Gene Heskett
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 99+ messages in thread
From: Måns Rullgård @ 2004-11-03 15:25 UTC (permalink / raw)
  To: linux-kernel

DervishD <lkml@dervishd.net> writes:

>     Hi all :)
>
>  * Måns Rullgård <mru@inprovide.com> dixit:
>> >> I'd tried to kill the zombie earlier but couldn't.
>> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
>> > Kill the parent, is the only (portable) way.
>> Perhaps not as portable, but another possible, though slightly
>> complicated, way is to ptrace the parent and force it to wait().
>
>     Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

You can only wait() for your own children.

-- 
Måns Rullgård
mru@inprovide.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 14:49   ` Måns Rullgård
@ 2004-11-03 15:25     ` DervishD
  2004-11-03 15:25       ` Måns Rullgård
                         ` (3 more replies)
  2004-11-03 16:38     ` Gene Heskett
  1 sibling, 4 replies; 99+ messages in thread
From: DervishD @ 2004-11-03 15:25 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

    Hi all :)

 * Måns Rullgård <mru@inprovide.com> dixit:
> >> I'd tried to kill the zombie earlier but couldn't.
> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> > Kill the parent, is the only (portable) way.
> Perhaps not as portable, but another possible, though slightly
> complicated, way is to ptrace the parent and force it to wait().

    Or write a little program that just 'wait()'s for the specified
PID's. That is perfectly portable IMHO. But I must admit that the
preferred way should be killing the parent. 'init' will reap the
children after that.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 14:33 ` bert hubert
  2004-11-03 14:49   ` Måns Rullgård
@ 2004-11-03 16:24   ` Gene Heskett
  2004-11-03 16:46     ` linux-os
  2004-11-03 20:13     ` Helge Hafting
  1 sibling, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 16:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: bert hubert

On Wednesday 03 November 2004 09:33, bert hubert wrote:
>On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>> But I'd tried to run gnomeradio earlier to listen to the
>> elections,
>
>Depressing enough.
>
>> I'd tried to kill the zombie earlier but couldn't.
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Kill the parent, is the only (portable) way.

The parent would have been the icon.  It opened its usual sized small 
window, but never did anything to it. I clicked on closing the 
window, but 10 seconds later the system asked me if I wanted to kill 
it as it wasn't responding. I said yes, the window disappeared, but 
kpm said gomeradio was still present as process 8162, and that wasn't 
killable.  Funny thing is, on the reboot, it automaticly self 
restored and ran just fine.

I consider this as one of linux's achilles heels.  Such a hung and 
dead process can be properly disposed of by a primitive os called os9 
because it keeps track of all resources in tables in the kernel 
memory space.  Issueing a kill procnumber removes the process from 
the exec queue, reclaims all its memory to the system free memory 
pool, and removes it from the IRQ service tables if an entry exists 
there.  Near instant, total cleanup, nothing left, in about 250 
microseconds max. 1.79 mhz cpu's aren't quite instant :)

Lets just say that I think having to reboot because of a zombie that 
has resources locked up, and have the reboot fubared by it too, 
aren't exactly friendly actions.

I fully realise that linux has a much more complex method of 
allocating resources, but doesn't it *know* exactly what resources 
have been passed out to each process?

And why is there no entry from the kill function into that resource 
management portion of the kernel so that this could also be done by 
the linux kernel, say with a "kill --total procnumber"?

Seems like a heck of a good question to me since an os written to run 
on a 64k machine in 1981, and expanded to run on a 128K to 2 megabyte 
machine in 1986 can do it just fine.  Even if that process is still 
running and spitting out data to its parent window/shell!  Or if its 
crashed and scribbled over all its memory, makes no difference to 
os9.  You (root) wants it gone, fine, its gone.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 14:49   ` Måns Rullgård
  2004-11-03 15:25     ` DervishD
@ 2004-11-03 16:38     ` Gene Heskett
  1 sibling, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 16:38 UTC (permalink / raw)
  To: linux-kernel; +Cc: Måns Rullgård

On Wednesday 03 November 2004 09:49, Måns Rullgård wrote:
>bert hubert <ahu@ds9a.nl> writes:
>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>> But I'd tried to run gnomeradio earlier to listen to the
>>> elections,
>>
>> Depressing enough.
>>
>>> I'd tried to kill the zombie earlier but couldn't.
>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>
>> Kill the parent, is the only (portable) way.
>
>Perhaps not as portable, but another possible, though slightly
>complicated, way is to ptrace the parent and force it to wait().

No deal.  No way.  The user needs something to clean up when he clicks 
on an icon, and things go to hell in a handbasket.  He has no advance 
warning available to him to tell him he had better ptrace this one 
that I'm aware of.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:24   ` Gene Heskett
@ 2004-11-03 16:46     ` linux-os
  2004-11-03 19:12       ` Gene Heskett
  2004-11-03 19:56       ` Måns Rullgård
  2004-11-03 20:13     ` Helge Hafting
  1 sibling, 2 replies; 99+ messages in thread
From: linux-os @ 2004-11-03 16:46 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, bert hubert

On Wed, 3 Nov 2004, Gene Heskett wrote:

> On Wednesday 03 November 2004 09:33, bert hubert wrote:
>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>> But I'd tried to run gnomeradio earlier to listen to the
>>> elections,
>>
>> Depressing enough.
>>
>>> I'd tried to kill the zombie earlier but couldn't.
>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>
>> Kill the parent, is the only (portable) way.
>
> The parent would have been the icon.  It opened its usual sized small
> window, but never did anything to it. I clicked on closing the
> window, but 10 seconds later the system asked me if I wanted to kill
> it as it wasn't responding. I said yes, the window disappeared, but
> kpm said gomeradio was still present as process 8162, and that wasn't
> killable.  Funny thing is, on the reboot, it automaticly self
> restored and ran just fine.
>
> I consider this as one of linux's achilles heels.  Such a hung and
> dead process can be properly disposed of by a primitive os called os9
> because it keeps track of all resources in tables in the kernel
> memory space.  Issueing a kill procnumber removes the process from
> the exec queue, reclaims all its memory to the system free memory
> pool, and removes it from the IRQ service tables if an entry exists
> there.  Near instant, total cleanup, nothing left, in about 250
> microseconds max. 1.79 mhz cpu's aren't quite instant :)
>
> Lets just say that I think having to reboot because of a zombie that
> has resources locked up, and have the reboot fubared by it too,
> aren't exactly friendly actions.

[SNIPPED....]

There is no problem killing a task and freeing its resources.
The problem is that Linux and other Unix variations need to
do this in a specific manner. That manner being that some
parent (or ultimately init) needs to receive the terminating
status. A task that has been otherwise killed, but is awaiting
its status to be obtained is in the 'Z' or zombie state. If
the code for either the child task or its parent was improperly
written, the death of a parent could allow a child to wait
forever (zombie).

The fix is to fix the code. Your temporary fix is to use
Ctrl-Alt-backspace to kill the X11 server (the parent).
If it doesn't restart (it's not a kernel problem, it's
a distribution problem), you can log in as root and
execute:

 	/etc/X11/prefdm &

All these little windows and icons are the 'children' of
the X server. The above is a temporary work-around for
a non-kernel problem.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by John Ashcroft.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 15:25     ` DervishD
  2004-11-03 15:25       ` Måns Rullgård
@ 2004-11-03 16:47       ` Gene Heskett
  2004-11-03 17:44         ` DervishD
  2004-11-04 16:01         ` kernel
  2004-11-03 22:58       ` Bill Davidsen
  2004-11-03 23:18       ` Adam Heath
  3 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 16:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: DervishD, Måns Rullgård

On Wednesday 03 November 2004 10:25, DervishD wrote:
>    Hi all :)
>
> * Måns Rullgård <mru@inprovide.com> dixit:
>> >> I'd tried to kill the zombie earlier but couldn't.
>> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
>> >
>> > Kill the parent, is the only (portable) way.
>>
>> Perhaps not as portable, but another possible, though slightly
>> complicated, way is to ptrace the parent and force it to wait().
>
>    Or write a little program that just 'wait()'s for the specified
>PID's. That is perfectly portable IMHO. But I must admit that the
>preferred way should be killing the parent. 'init' will reap the
>children after that.

But what if there is no parent, since the system has already disposed 
of it?

There was no parent visible to kpm.  Unforch kpm also doesn't 
specificaly mark zombies as such either, so its a bit clueless in 
that regard.  Finding them is usually an exersize in stretching the 
top window out till its about 20 screens high as its always going to 
be at the bottom of the list.

If init can indeed do the cleanup, then how hard is it to have a "kill 
--total procnumber" pass that info into init and let it do its thing?

Or better yet, when X asks me if I want it gone because its not 
responding to the close button, have X do it all in one swell foop.

>    Raúl Núñez de Arenas Coronado

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:47       ` Gene Heskett
@ 2004-11-03 17:44         ` DervishD
  2004-11-03 18:53           ` Gene Heskett
  2004-11-04 16:01         ` kernel
  1 sibling, 1 reply; 99+ messages in thread
From: DervishD @ 2004-11-03 17:44 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Måns Rullgård

    Hi Gene :)

 * Gene Heskett <gene.heskett@verizon.net> dixit:
> >    Or write a little program that just 'wait()'s for the specified
> >PID's. That is perfectly portable IMHO. But I must admit that the
> >preferred way should be killing the parent. 'init' will reap the
> >children after that.
> But what if there is no parent, since the system has already disposed 
> of it?

    Then the children are reparented to 'init' and 'init' gets rid of
them. That's the way UNIX behaves.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 15:25       ` Måns Rullgård
@ 2004-11-03 17:49         ` DervishD
  0 siblings, 0 replies; 99+ messages in thread
From: DervishD @ 2004-11-03 17:49 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

    Hi Måns :)

 * Måns Rullgård <mru@inprovide.com> dixit:
> >> >> I'd tried to kill the zombie earlier but couldn't.
> >> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> >> > Kill the parent, is the only (portable) way.
> >> Perhaps not as portable, but another possible, though slightly
> >> complicated, way is to ptrace the parent and force it to wait().
> >     Or write a little program that just 'wait()'s for the specified
> > PID's. That is perfectly portable IMHO. But I must admit that the
> > preferred way should be killing the parent. 'init' will reap the
> > children after that.
> You can only wait() for your own children.

    Yes, you will receive 'ECHILD', I didn't remember that, sorry.
Anyway, you shouldn't need to do that, since those zombies should
have been reparented to 'init'.

    But, since SUSv3 doesn't specify which PID should be the parent
when doing the reparenting, PID 0 could be used when reparenting as a
way of telling the kernel "hey, rip those processes". Anyway, since
the kernel does the reparenting, the kernel could get rid of zombies.
I don't really know why is 'init' (PID 1) responsible of this.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 17:44         ` DervishD
@ 2004-11-03 18:53           ` Gene Heskett
  2004-11-03 19:01             ` Doug McNaught
                               ` (3 more replies)
  0 siblings, 4 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 18:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: DervishD, Måns Rullgård

On Wednesday 03 November 2004 12:44, DervishD wrote:
>    Hi Gene :)
>
> * Gene Heskett <gene.heskett@verizon.net> dixit:
>> >    Or write a little program that just 'wait()'s for the
>> > specified PID's. That is perfectly portable IMHO. But I must
>> > admit that the preferred way should be killing the parent.
>> > 'init' will reap the children after that.
>>
>> But what if there is no parent, since the system has already
>> disposed of it?
>
>    Then the children are reparented to 'init' and 'init' gets rid
> of them. That's the way UNIX behaves.

Unforch, I've *never* had it work that way.  Any dead process I've 
ever had while running linux has only been disposable by a reboot.

>    Raúl Núñez de Arenas Coronado

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 18:53           ` Gene Heskett
@ 2004-11-03 19:01             ` Doug McNaught
  2004-11-03 19:03             ` Måns Rullgård
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Doug McNaught @ 2004-11-03 19:01 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, DervishD, Måns Rullgård

Gene Heskett <gene.heskett@verizon.net> writes:

> On Wednesday 03 November 2004 12:44, DervishD wrote:

>>    Then the children are reparented to 'init' and 'init' gets rid
>> of them. That's the way UNIX behaves.
>
> Unforch, I've *never* had it work that way.  Any dead process I've 
> ever had while running linux has only been disposable by a reboot.

Then it's either (a) not actually a zombie (perhaps stuck in D state),
or (b) its parent is still alive.

A zombie process is just an entry in the process table where the exit
status etc are stored until the parent reaps it--all other resources
(memory, FDs etc) have been released.  So if your "zombie" process is
actually taking up resources (which I think you said in an earlier
post), there's something else at work.

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 18:53           ` Gene Heskett
  2004-11-03 19:01             ` Doug McNaught
@ 2004-11-03 19:03             ` Måns Rullgård
  2004-11-03 19:24               ` Gene Heskett
  2004-11-03 19:06             ` Valdis.Kletnieks
  2004-11-03 19:26             ` DervishD
  3 siblings, 1 reply; 99+ messages in thread
From: Måns Rullgård @ 2004-11-03 19:03 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, DervishD

Gene Heskett <gene.heskett@verizon.net> writes:

> On Wednesday 03 November 2004 12:44, DervishD wrote:
>>    Hi Gene :)
>>
>> * Gene Heskett <gene.heskett@verizon.net> dixit:
>>> >    Or write a little program that just 'wait()'s for the
>>> > specified PID's. That is perfectly portable IMHO. But I must
>>> > admit that the preferred way should be killing the parent.
>>> > 'init' will reap the children after that.
>>>
>>> But what if there is no parent, since the system has already
>>> disposed of it?
>>
>>    Then the children are reparented to 'init' and 'init' gets rid
>> of them. That's the way UNIX behaves.
>
> Unforch, I've *never* had it work that way.  Any dead process I've 
> ever had while running linux has only been disposable by a reboot.

That's because its parent was still sitting around refusing to wait()
for them.

-- 
Måns Rullgård
mru@inprovide.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 18:53           ` Gene Heskett
  2004-11-03 19:01             ` Doug McNaught
  2004-11-03 19:03             ` Måns Rullgård
@ 2004-11-03 19:06             ` Valdis.Kletnieks
  2004-11-03 19:26               ` Gene Heskett
  2004-11-03 19:26             ` DervishD
  3 siblings, 1 reply; 99+ messages in thread
From: Valdis.Kletnieks @ 2004-11-03 19:06 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, DervishD, Måns Rullgård

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Wed, 03 Nov 2004 13:53:39 EST, Gene Heskett said:
> On Wednesday 03 November 2004 12:44, DervishD wrote:

> >    Then the children are reparented to 'init' and 'init' gets rid
> > of them. That's the way UNIX behaves.
> 
> Unforch, I've *never* had it work that way.  Any dead process I've 
> ever had while running linux has only been disposable by a reboot.

The problem  likely isn't the true "zombie" - the only thing that *those*
processes have left is a process table entry to save the exit code for a wait()
syscall that might not happen anytime soon.  And unless you have hundreds
of them sitting around causing pressure on the 32K process limit, they're
probably not a big problem.

More likely, what you're looking at is some process that has gone down into the
kernel on some syscall or other and gotten blocked.  Since signals aren't
delivered until it returns, it ends up "unkillable".

Traditionally, a common cause for such wedging was a lost/misplaced interrupt
from an I/O operation, so a read()/write()/ioctl() call wouldn't return because
the device hadn't reported it completed. (tape drives were notorious for this).
Often, power-cycling the I/O device would cause an unsolicited interrupt to be
generated, which would clear the "waiting for interrupt" issue and allow the
process to return....


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:46     ` linux-os
@ 2004-11-03 19:12       ` Gene Heskett
  2004-11-03 19:56       ` Måns Rullgård
  1 sibling, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 19:12 UTC (permalink / raw)
  To: linux-kernel, linux-os; +Cc: bert hubert

On Wednesday 03 November 2004 11:46, linux-os wrote:
>On Wed, 3 Nov 2004, Gene Heskett wrote:
>> On Wednesday 03 November 2004 09:33, bert hubert wrote:
>>> On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
>>>> But I'd tried to run gnomeradio earlier to listen to the
>>>> elections,
>>>
>>> Depressing enough.
>>>
>>>> I'd tried to kill the zombie earlier but couldn't.
>>>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>>>
>>> Kill the parent, is the only (portable) way.
>>
>> The parent would have been the icon.  It opened its usual sized
>> small window, but never did anything to it. I clicked on closing
>> the window, but 10 seconds later the system asked me if I wanted
>> to kill it as it wasn't responding. I said yes, the window
>> disappeared, but kpm said gomeradio was still present as process
>> 8162, and that wasn't killable.  Funny thing is, on the reboot, it
>> automaticly self restored and ran just fine.
>>
>> I consider this as one of linux's achilles heels.  Such a hung and
>> dead process can be properly disposed of by a primitive os called
>> os9 because it keeps track of all resources in tables in the
>> kernel memory space.  Issueing a kill procnumber removes the
>> process from the exec queue, reclaims all its memory to the system
>> free memory pool, and removes it from the IRQ service tables if an
>> entry exists there.  Near instant, total cleanup, nothing left, in
>> about 250 microseconds max. 1.79 mhz cpu's aren't quite instant :)
>>
>> Lets just say that I think having to reboot because of a zombie
>> that has resources locked up, and have the reboot fubared by it
>> too, aren't exactly friendly actions.
>
>[SNIPPED....]
>
>There is no problem killing a task and freeing its resources.
>The problem is that Linux and other Unix variations need to
>do this in a specific manner. That manner being that some
>parent (or ultimately init) needs to receive the terminating
>status. A task that has been otherwise killed, but is awaiting
>its status to be obtained is in the 'Z' or zombie state. If
>the code for either the child task or its parent was improperly
>written, the death of a parent could allow a child to wait
>forever (zombie).
>
>The fix is to fix the code. 

In other words, its gnomeradio that needs fixed then?

Its the best 'radio' proggy I've run across that works with my 
hardware, but I'm not sure it has a support person at ths late date.  
Its probably not been touched in 2 years.  Kde doesn't appear to have 
a similar util that I've run across in the menu's so far, and its 
3.3.0 here.

All of which seems to be dancing around the real problem though.  
There seems to be no handy (to the user) path into the kernel to 
allow such a killing unconditionally function.  root should have that 
ability.

>Your temporary fix is to use 
>Ctrl-Alt-backspace to kill the X11 server (the parent).

The logout took about 2 minutes because X couldn't clear itself 
either.

>If it doesn't restart (it's not a kernel problem, it's
>a distribution problem), you can log in as root and
>execute:
>
>  /etc/X11/prefdm &

I'll try that next time.

>All these little windows and icons are the 'children' of
>the X server. The above is a temporary work-around for
>a non-kernel problem.

But a problem the kernel really should be capable of handling 
transparently.
>
>Cheers,
>Dick Johnson
>Penguin : Linux version 2.6.9 on an i686 machine (5537.79 BogoMips).
>  Notice : All mail here is now cached for review by John Ashcroft.

What on earth for?  I don't issue anything he would be interested in 
except the first part of my sig.  And thats been in my sig for a year 
or so, and will stay there till the so-called Patriot Act is 
repealed.  John Ashcroft has done more damage to democracy 
single-handedly because of his paranoia than any other 20 men in our 
history.  G. Washington certainly wouldn't have tolerated such a 
person in his 1st term of government.

Depending on the mailing list, data here has a lifetime as short as 30 
days.

Sorry about spilling politics into the list folks.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:03             ` Måns Rullgård
@ 2004-11-03 19:24               ` Gene Heskett
  2004-11-03 19:33                 ` Doug McNaught
  2004-11-03 19:34                 ` Måns Rullgård
  0 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 19:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Måns Rullgård, DervishD

On Wednesday 03 November 2004 14:03, Måns Rullgård wrote:
>Gene Heskett <gene.heskett@verizon.net> writes:
>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>>>    Hi Gene :)
>>>
>>> * Gene Heskett <gene.heskett@verizon.net> dixit:
>>>> >    Or write a little program that just 'wait()'s for the
>>>> > specified PID's. That is perfectly portable IMHO. But I must
>>>> > admit that the preferred way should be killing the parent.
>>>> > 'init' will reap the children after that.
>>>>
>>>> But what if there is no parent, since the system has already
>>>> disposed of it?
>>>
>>>    Then the children are reparented to 'init' and 'init' gets rid
>>> of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way.  Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
>That's because its parent was still sitting around refusing to
> wait() for them.

Define 'parent' when it was a click on the apps icon on the xwindow 
screen that started it, please.

-- 
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:06             ` Valdis.Kletnieks
@ 2004-11-03 19:26               ` Gene Heskett
  2004-11-03 19:33                 ` Valdis.Kletnieks
  2004-11-03 19:42                 ` DervishD
  0 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 19:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Valdis.Kletnieks, DervishD, Måns Rullgård

On Wednesday 03 November 2004 14:06, Valdis.Kletnieks@vt.edu wrote:
>On Wed, 03 Nov 2004 13:53:39 EST, Gene Heskett said:
>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>> >    Then the children are reparented to 'init' and 'init' gets
>> > rid of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way.  Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
>The problem  likely isn't the true "zombie" - the only thing that
> *those* processes have left is a process table entry to save the
> exit code for a wait() syscall that might not happen anytime soon. 
> And unless you have hundreds of them sitting around causing
> pressure on the 32K process limit, they're probably not a big
> problem.
>
>More likely, what you're looking at is some process that has gone
> down into the kernel on some syscall or other and gotten blocked. 
> Since signals aren't delivered until it returns, it ends up
> "unkillable".
>
>Traditionally, a common cause for such wedging was a lost/misplaced
> interrupt from an I/O operation, so a read()/write()/ioctl() call
> wouldn't return because the device hadn't reported it completed.
> (tape drives were notorious for this). Often, power-cycling the I/O
> device would cause an unsolicited interrupt to be generated, which
> would clear the "waiting for interrupt" issue and allow the process
> to return....

Well, since the "device", a bt878 based Haupagge tv card is sitting in 
a pci socket, thats even more drastic than a reboot.

-- 
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 18:53           ` Gene Heskett
                               ` (2 preceding siblings ...)
  2004-11-03 19:06             ` Valdis.Kletnieks
@ 2004-11-03 19:26             ` DervishD
  2004-11-03 20:18               ` Gene Heskett
                                 ` (2 more replies)
  3 siblings, 3 replies; 99+ messages in thread
From: DervishD @ 2004-11-03 19:26 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Måns Rullgård

    Hi Gene :)

 * Gene Heskett <gene.heskett@verizon.net> dixit:
> >    Then the children are reparented to 'init' and 'init' gets rid
> > of them. That's the way UNIX behaves.
> Unforch, I've *never* had it work that way.  Any dead process I've 
> ever had while running linux has only been disposable by a reboot.

    Well, you know, shit happens... Anyway, could you define 'dead'?
Because if you're talking about zombies whose parent dies, they're
killable easily: just wait until init reaps them (usually in less
than 5 minutes since they dead). If you are talking about zombies who
has their parent alive, then it's a bug in the application, not the
kernel. In fact I wouldn't like if the kernel reaps my children
before I do, just in case I want to do something.

    If you're talking about unkillable processes (those stuck in
disk-sleep state), you're right: only rebooting can kill them
(although sometimes they go out of D state and die normally). Bad
luck for you if any dead process you've ever had while running linux
has been of this kind :(

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:24               ` Gene Heskett
@ 2004-11-03 19:33                 ` Doug McNaught
  2004-11-03 19:34                 ` Måns Rullgård
  1 sibling, 0 replies; 99+ messages in thread
From: Doug McNaught @ 2004-11-03 19:33 UTC (permalink / raw)
  To: gheskett; +Cc: linux-kernel, Måns Rullgård, DervishD

Gene Heskett <gheskett@wdtv.com> writes:

> On Wednesday 03 November 2004 14:03, Måns Rullgård wrote:

>>
>>That's because its parent was still sitting around refusing to
>> wait() for them.
>
> Define 'parent' when it was a click on the apps icon on the xwindow 
> screen that started it, please.

Whichever process called fork() to create the app process is the
parent.  Sounds like it's some component of the desktop environment. 

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:26               ` Gene Heskett
@ 2004-11-03 19:33                 ` Valdis.Kletnieks
  2004-11-03 20:09                   ` Gene Heskett
  2004-11-03 19:42                 ` DervishD
  1 sibling, 1 reply; 99+ messages in thread
From: Valdis.Kletnieks @ 2004-11-03 19:33 UTC (permalink / raw)
  To: gheskett; +Cc: linux-kernel, DervishD, Måns Rullgård

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:

> Well, since the "device", a bt878 based Haupagge tv card is sitting in 
> a pci socket, thats even more drastic than a reboot.

Not if you have a good hot-swap PCI cage. ;)

Anyhow, that points even more at a driver issue for the bt878 -
if you can get Sysrq-T output, where does it say the hung process is
inside the kernel?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:24               ` Gene Heskett
  2004-11-03 19:33                 ` Doug McNaught
@ 2004-11-03 19:34                 ` Måns Rullgård
  1 sibling, 0 replies; 99+ messages in thread
From: Måns Rullgård @ 2004-11-03 19:34 UTC (permalink / raw)
  To: gheskett; +Cc: linux-kernel, DervishD

Gene Heskett <gheskett@wdtv.com> writes:

> On Wednesday 03 November 2004 14:03, Måns Rullgård wrote:
>>Gene Heskett <gene.heskett@verizon.net> writes:
>>> On Wednesday 03 November 2004 12:44, DervishD wrote:
>>>>    Hi Gene :)
>>>>
>>>> * Gene Heskett <gene.heskett@verizon.net> dixit:
>>>>> >    Or write a little program that just 'wait()'s for the
>>>>> > specified PID's. That is perfectly portable IMHO. But I must
>>>>> > admit that the preferred way should be killing the parent.
>>>>> > 'init' will reap the children after that.
>>>>>
>>>>> But what if there is no parent, since the system has already
>>>>> disposed of it?
>>>>
>>>>    Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>> Unforch, I've *never* had it work that way.  Any dead process I've
>>> ever had while running linux has only been disposable by a reboot.
>>
>>That's because its parent was still sitting around refusing to
>> wait() for them.
>
> Define 'parent' when it was a click on the apps icon on the xwindow 
> screen that started it, please.

Run "ps axf".

-- 
Måns Rullgård
mru@inprovide.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:26               ` Gene Heskett
  2004-11-03 19:33                 ` Valdis.Kletnieks
@ 2004-11-03 19:42                 ` DervishD
  2004-11-03 23:12                   ` Bill Davidsen
  1 sibling, 1 reply; 99+ messages in thread
From: DervishD @ 2004-11-03 19:42 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Valdis.Kletnieks, Måns Rullgård

    Hi Gene :)

 * Gene Heskett <gheskett@wdtv.com> dixit:
> >Traditionally, a common cause for such wedging was a lost/misplaced
> > interrupt from an I/O operation, so a read()/write()/ioctl() call
> > wouldn't return because the device hadn't reported it completed.
> > (tape drives were notorious for this). Often, power-cycling the I/O
> > device would cause an unsolicited interrupt to be generated, which
> > would clear the "waiting for interrupt" issue and allow the process
> > to return....
> Well, since the "device", a bt878 based Haupagge tv card is sitting in 
> a pci socket, thats even more drastic than a reboot.

    Do you mean your Hauppage got stuck in disk-sleep state? Wow,
that's sound *weird*...

    I think that the parent (which is whatever process did the fork
when you clicked your mouse) is still alive and forgetting to do the
'wait()' for its children.
 
    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:46     ` linux-os
  2004-11-03 19:12       ` Gene Heskett
@ 2004-11-03 19:56       ` Måns Rullgård
  1 sibling, 0 replies; 99+ messages in thread
From: Måns Rullgård @ 2004-11-03 19:56 UTC (permalink / raw)
  To: linux-kernel

linux-os <linux-os@chaos.analogic.com> writes:

> The fix is to fix the code. Your temporary fix is to use
> Ctrl-Alt-backspace to kill the X11 server (the parent).

The X server is not the parent.  The desktop manager (or whatever
those beasts are called) is more likely to be.

> All these little windows and icons are the 'children' of the X
> server.

The X server manages a set of windows, arranged in a logical tree
structure, with all windows ultimately descending from the root
windows.  The parent-child relationships between windows should under
no circumstance be confused, or compared, with that between processes.
Any process, on any machine on the network, can, given enough
privileges, create subwindows of any window on the X server.  Windows
and process belong to different worlds, the only connection between
which is that processes create windows, simply since anything that
happens in the computer is done by a process (or interrupt handler).

Am I really reading this on linux-kernel?

-- 
Måns Rullgård
mru@inprovide.com


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:33                 ` Valdis.Kletnieks
@ 2004-11-03 20:09                   ` Gene Heskett
  2004-11-04 19:24                     ` Bill Davidsen
  0 siblings, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 20:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Valdis.Kletnieks, DervishD, Måns Rullgård

On Wednesday 03 November 2004 14:33, Valdis.Kletnieks@vt.edu wrote:
>On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:
>> Well, since the "device", a bt878 based Haupagge tv card is
>> sitting in a pci socket, thats even more drastic than a reboot.
>
>Not if you have a good hot-swap PCI cage. ;)
>
>Anyhow, that points even more at a driver issue for the bt878 -
>if you can get Sysrq-T output, where does it say the hung process is
>inside the kernel?

Thats another thing I've had compiled in since forever, but it so 
seldom actually *works*, I've tended to forget about it.

-- 
Cheers, gene
gheskett at wdtv dot com
99.28% setiathome rank, not too bad for a WV hillbilly

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:24   ` Gene Heskett
  2004-11-03 16:46     ` linux-os
@ 2004-11-03 20:13     ` Helge Hafting
  2004-11-03 20:40       ` Gene Heskett
  1 sibling, 1 reply; 99+ messages in thread
From: Helge Hafting @ 2004-11-03 20:13 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, bert hubert

On Wed, Nov 03, 2004 at 11:24:19AM -0500, Gene Heskett wrote:
> On Wednesday 03 November 2004 09:33, bert hubert wrote:
> >On Wed, Nov 03, 2004 at 07:51:39AM -0500, Gene Heskett wrote:
> >> But I'd tried to run gnomeradio earlier to listen to the
> >> elections,
> >
> >Depressing enough.
> >
> >> I'd tried to kill the zombie earlier but couldn't.
> >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> >
> >Kill the parent, is the only (portable) way.
> 
> The parent would have been the icon.  It opened its usual sized small 
> window, but never did anything to it. I clicked on closing the 
> window, but 10 seconds later the system asked me if I wanted to kill 
> it as it wasn't responding. I said yes, the window disappeared, but 
> kpm said gomeradio was still present as process 8162, and that wasn't 
> killable.  Funny thing is, on the reboot, it automaticly self 
> restored and ran just fine.
> 
> I consider this as one of linux's achilles heels.  Such a hung and 
> dead process can be properly disposed of by a primitive os called os9 
> because it keeps track of all resources in tables in the kernel 
> memory space.  Issueing a kill procnumber removes the process from 
> the exec queue, reclaims all its memory to the system free memory 
> pool, and removes it from the IRQ service tables if an entry exists 
> there.  Near instant, total cleanup, nothing left, in about 250 
> microseconds max. 1.79 mhz cpu's aren't quite instant :)
> 
Killing a process in linux with "kill -9 oid" also release all resources, 
such as memory and file descriptors.  The resource consumption of a 
"zombie" is measured in bytes, not kilobytes.

> Lets just say that I think having to reboot because of a zombie that 
> has resources locked up, and have the reboot fubared by it too, 
> aren't exactly friendly actions.
> 
Did you try logging out from the graphical user interface,
and then logging in again?
GUI programs are usually children of the window manager (or some
app launcher, all of these quit when you log out.  A plain
zombie started from the GUI will disappear after that.  

Only something stuck in a device driver will need the reboot,
but that tends to be a bug in the driver.
You can try unloading the driver module, but linux has a
nasty tendency to answer that with an OOPS or worse.  When
something goes wrong - it does so properly and thourougly. :-)

> I fully realise that linux has a much more complex method of 
> allocating resources, but doesn't it *know* exactly what resources 
> have been passed out to each process?
> 
Yes it does - the problem is that not all resources are managed
by processes.  Some allocations are managed by drivers, so a driver
bug can get the device into a unuseable state _and_ tie up the
process(es) that were using the driver at the moment.

> And why is there no entry from the kill function into that resource 
> management portion of the kernel so that this could also be done by 
> the linux kernel, say with a "kill --total procnumber"?
> 
Interesting, but you might need a path from "kill" into
every device driver. :-/  And of course it wtill won't work 
if there is a bug in the driver. 

> Seems like a heck of a good question to me since an os written to run 
> on a 64k machine in 1981, and expanded to run on a 128K to 2 megabyte 
> machine in 1986 can do it just fine.  Even if that process is still 
> running and spitting out data to its parent window/shell!  Or if its 
> crashed and scribbled over all its memory, makes no difference to 
> os9.  You (root) wants it gone, fine, its gone.
> 
Can os9 do this if the process is busy calling into a buggy
device driver that simply doesn't return or perhaps believes
that some dma operation into process memory is taking forever?
Or perhaps os9 doesn't have lots and lots of drivers written by
different people with varying competence? 

Often, the real solution is to fix the driver to deal with
"unexpected" conditions.

Helge Hafting


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:26             ` DervishD
@ 2004-11-03 20:18               ` Gene Heskett
  2004-11-03 22:15               ` Jim Nelson
  2004-11-03 23:07               ` Bill Davidsen
  2 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 20:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: DervishD, Måns Rullgård

On Wednesday 03 November 2004 14:26, DervishD wrote:
>    Hi Gene :)
>
> * Gene Heskett <gene.heskett@verizon.net> dixit:
>> >    Then the children are reparented to 'init' and 'init' gets
>> > rid of them. That's the way UNIX behaves.
>>
>> Unforch, I've *never* had it work that way.  Any dead process I've
>> ever had while running linux has only been disposable by a reboot.
>
>    Well, you know, shit happens... Anyway, could you define 'dead'?
>Because if you're talking about zombies whose parent dies, they're
>killable easily: just wait until init reaps them (usually in less
>than 5 minutes since they dead). If you are talking about zombies
> who has their parent alive, then it's a bug in the application, not
> the kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
>
>    If you're talking about unkillable processes (those stuck in
>disk-sleep state), you're right: only rebooting can kill them
>(although sometimes they go out of D state and die normally). Bad
>luck for you if any dead process you've ever had while running linux
>has been of this kind :(
>
>    Raúl Núñez de Arenas Coronado

That seems to be the only kind of dead processes I get, and thats not 
too often.  Booted to 2.6.10-rc1-bk11 now, its all working just fine 
except for on messydos patch that finally must have made it into the 
tree.

As it appears I do not have a prayer of convincing folks otherwise 
about this issue, I suggest we let this thread die a well deserved 
death till it bites me or someone else again.  I'll summerize that 
os9/nitros9 handles this situation effortlessly and flawlessly, and I 
expected a 150x more sophisticated os to do likewise.  My mistake.  
OTOH, its one hell of a versatile os IMNSHO.  I'm not going away just 
because it bites me occasionally.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:13     ` Helge Hafting
@ 2004-11-03 20:40       ` Gene Heskett
  2004-11-04  0:43         ` Kurt Wall
  2004-11-04 10:07         ` Matthias Andree
  0 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 20:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Helge Hafting, bert hubert

On Wednesday 03 November 2004 15:13, Helge Hafting wrote:
>On Wed, Nov 03, 2004 at 11:24:19AM -0500, Gene Heskett wrote:
[...]
>> Lets just say that I think having to reboot because of a zombie
>> that has resources locked up, and have the reboot fubared by it
>> too, aren't exactly friendly actions.
>
>Did you try logging out from the graphical user interface,
>and then logging in again?

It took around 2 minutes for the logout of X to get back to a VC.
So obviously something slowed it down as thats a 4 second operation 
here normally.  And it didn't surprise me when the "reboot" shutdown 
hung on "stopping alsasound" and I had to use the reset button.
[...]
>> I fully realise that linux has a much more complex method of
>> allocating resources, but doesn't it *know* exactly what resources
>> have been passed out to each process?
>
>Yes it does - the problem is that not all resources are managed
>by processes.  Some allocations are managed by drivers, so a driver
>bug can get the device into a unuseable state _and_ tie up the
>process(es) that were using the driver at the moment.

This from my viewpoint, is wrong.  The kernel, and only the kernel 
should be ultimately responsible for handing out resources, and 
reclaiming at its convienience.

>> And why is there no entry from the kill function into that
>> resource management portion of the kernel so that this could also
>> be done by the linux kernel, say with a "kill --total procnumber"?
>
>Interesting, but you might need a path from "kill" into
>every device driver. :-/  And of course it wtill won't work
>if there is a bug in the driver.

Thats the fault of the design IMO.

>> Seems like a heck of a good question to me since an os written to
>> run on a 64k machine in 1981, and expanded to run on a 128K to 2
>> megabyte machine in 1986 can do it just fine.  Even if that
>> process is still running and spitting out data to its parent
>> window/shell!  Or if its crashed and scribbled over all its
>> memory, makes no difference to os9.  You (root) wants it gone,
>> fine, its gone.
>
>Can os9 do this if the process is busy calling into a buggy
>device driver that simply doesn't return or perhaps believes
>that some dma operation into process memory is taking forever?
>Or perhaps os9 doesn't have lots and lots of drivers written by
>different people with varying competence?

It did have quite a few authors involved in it over the years 
including me, I did many of its utilities, and converted the rbf.mn 
from 6809 code to 6309 code, roughly doubleing its speed without 
fiddling with the clock speed, which is married to the video on that 
machine.  I also did a couple of its clock modules, which are the 
heart of the multitasking it does.  And yes, it could kill, 
absolutely cleanly, any process you named on the command line at any 
time.  Any drivers involved got their scratch space from the callers 
loading of a set of pointers, so if a driver was being accessed by 2 
or more processes, each instance had its own stack/process space.  
When the process disappeared, the recovery included that space in 
memory.  The driver proper had no long term history of that processes 
actions, even if a disk seek microsleep or similar was in progress 
when the caller disappeared.

>Often, the real solution is to fix the driver to deal with
>"unexpected" conditions.
>
>Helge Hafting

As I said earlier, lets let this horse be buried, "its dead Jim", and 
my beating on it is only wasting bandwitdh.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 12:51 is killing zombies possible w/o a reboot? Gene Heskett
  2004-11-03 14:33 ` bert hubert
@ 2004-11-03 20:48 ` Tom Felker
  2004-11-03 21:08   ` Gene Heskett
  2004-11-05  0:29   ` Gene Heskett
  1 sibling, 2 replies; 99+ messages in thread
From: Tom Felker @ 2004-11-03 20:48 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel

On Wednesday 03 November 2004 06:51 am, Gene Heskett wrote:
> Greetings;
>
> I thought I'd get caught up on -bkx kernels and made a -bk8 just now.
>
> But I'd tried to run gnomeradio earlier to listen to the elections,
> but it failed leaving to run, as did tvtime then too, claiming it
> couldn't get a lock on /dev/video0, and gnomeradio apparently left a
> lock on alsasound that prevented the normal gracefull shutdown by
> locking up the shutdown on the "stopping alsasound" line.  So I had
> to use the hardware reset.
>
> I'd tried to kill the zombie earlier but couldn't.
>
> Isn't there some way to clean up a &^$#^#@)_ zombie?

Ok, let me try to explain what probably happened.

First, terminology.  When one process wants to be come two processes, it 
fork()s.  One process is the parent, and one it the child.  The child usually 
exec()s to become a different program.  The parent sometimes wants to know 
when the child ends and whether it succeeded.  Thus, the wait() system calls.  
The parent can either check whether a child died, or go to sleep until one 
does.  When the parent is awaken, it's told which child died and what the 
child's exit status was (usually 0 for success).  But if the child dies 
before the parent wait()s, the kernel must keep a record of which child died 
and what its exit status was, and it can't reassign the late child's PID yet.  
This record is a "zombie," and shows up under top or ps with the 'Z' state.  
Zombies do _not_ hold open files, memory, or resources of any kind.

That's the technical definition of a zombie, which I'm telling you because 
that's probably not your situation:  I assume you used "zombie" as an 
informal term for a process that you can't kill.  Your problem is a process 
in uninterruptible sleep (the "D" state).

When a process executing in userspace wants information from a device, like a 
disk or TV capture card, it calls read(), and context switches into kernel 
space.  Usually, it will take a moment for the data to be available from the 
device, so the process gets put on a wait queue so other processes can run.  
Obviously nothing is deallocated, because everyone expects the process will 
get it's data and proceed as normal.  When the device has the data, it 
interrupts the CPU, and the kernel figures out who wanted the data and puts 
them on the run queue.

When a process is on a wait queue waiting for data from a device (the D 
state), it's impossible to kill.  This is because otherwise, when the 
interrupt did come, the structures associated with the process would have 
been freed, and the kernel would crash.  It would require an incredible 
amount of innefficient bookkeeping to avoid this, and it's unnecessary 
because normally, the data request will finish (successfully or not), and the 
process will be woken up, or if it was sent SIGKILL, it will be killed.

Long story short, what happened was, some faulty hardware or some buggy 
driver, probably associated with the capture card, had a problem and left the 
process in D state.  Thus, it couldn't be killed, and since it had /dev/video 
open, tvtime couldn't run and failed gracefully, and because it held /dev/dsp 
open, and couldn't be killed as the init scripts would normally do in that 
situation, the audio drivers couldn't be unloaded and the boot process hung.

So give us a bunch of information about what hardware you're using, output of 
dmesg, and steps to reproduce the driver bug (if it is that).
-- 
Tom Felker, <tcfelker@mtco.com>
<http://vlevel.sourceforge.net> - Stop fiddling with the volume knob.

If you have to design something and control freaks are involved, give them 
plenty of knobs, but don't connect them to anything important.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:48 ` Tom Felker
@ 2004-11-03 21:08   ` Gene Heskett
  2004-11-04  7:19     ` Jan Knutar
  2004-11-05  0:29   ` Gene Heskett
  1 sibling, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-03 21:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tom Felker

On Wednesday 03 November 2004 15:48, Tom Felker wrote:
>On Wednesday 03 November 2004 06:51 am, Gene Heskett wrote:
>> Greetings;
>>
>> I thought I'd get caught up on -bkx kernels and made a -bk8 just
>> now.
>>
>> But I'd tried to run gnomeradio earlier to listen to the
>> elections, but it failed leaving to run, as did tvtime then too,
>> claiming it couldn't get a lock on /dev/video0, and gnomeradio
>> apparently left a lock on alsasound that prevented the normal
>> gracefull shutdown by locking up the shutdown on the "stopping
>> alsasound" line.  So I had to use the hardware reset.
>>
>> I'd tried to kill the zombie earlier but couldn't.
>>
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Ok, let me try to explain what probably happened.
>
>First, terminology.  When one process wants to be come two
> processes, it fork()s.  One process is the parent, and one it the
> child.  The child usually exec()s to become a different program. 
> The parent sometimes wants to know when the child ends and whether
> it succeeded.  Thus, the wait() system calls. The parent can either
> check whether a child died, or go to sleep until one does.  When
> the parent is awaken, it's told which child died and what the
> child's exit status was (usually 0 for success).  But if the child
> dies before the parent wait()s, the kernel must keep a record of
> which child died and what its exit status was, and it can't
> reassign the late child's PID yet. This record is a "zombie," and
> shows up under top or ps with the 'Z' state. Zombies do _not_ hold
> open files, memory, or resources of any kind.
>
>That's the technical definition of a zombie, which I'm telling you
> because that's probably not your situation:  I assume you used
> "zombie" as an informal term for a process that you can't kill. 
> Your problem is a process in uninterruptible sleep (the "D" state).
>
>When a process executing in userspace wants information from a
> device, like a disk or TV capture card, it calls read(), and
> context switches into kernel space.  Usually, it will take a moment
> for the data to be available from the device, so the process gets
> put on a wait queue so other processes can run. Obviously nothing
> is deallocated, because everyone expects the process will get it's
> data and proceed as normal.  When the device has the data, it
> interrupts the CPU, and the kernel figures out who wanted the data
> and puts them on the run queue.
>
>When a process is on a wait queue waiting for data from a device
> (the D state), it's impossible to kill.  This is because otherwise,
> when the interrupt did come, the structures associated with the
> process would have been freed, and the kernel would crash.  It
> would require an incredible amount of innefficient bookkeeping to
> avoid this, and it's unnecessary because normally, the data request
> will finish (successfully or not), and the process will be woken
> up, or if it was sent SIGKILL, it will be killed.
>
>Long story short, what happened was, some faulty hardware or some
> buggy driver, probably associated with the capture card, had a
> problem and left the process in D state.  Thus, it couldn't be
> killed, and since it had /dev/video open, tvtime couldn't run and
> failed gracefully, and because it held /dev/dsp open, and couldn't
> be killed as the init scripts would normally do in that situation,
> the audio drivers couldn't be unloaded and the boot process hung.
>
>So give us a bunch of information about what hardware you're using,
> output of dmesg, and steps to reproduce the driver bug (if it is
> that).

Its a dead horse Tom, lets bury it.  I've rebooted to 4 new kernels 
since that time as I march toward getting caught up with whatever 
bk(nn) is out today.  Other than that, which took place on bk7's 
watch, its all working rather well.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:26             ` DervishD
  2004-11-03 20:18               ` Gene Heskett
@ 2004-11-03 22:15               ` Jim Nelson
  2004-11-03 22:44                 ` Russell Miller
  2004-11-04 16:30                 ` Pedro Venda (SYSADM)
  2004-11-03 23:07               ` Bill Davidsen
  2 siblings, 2 replies; 99+ messages in thread
From: Jim Nelson @ 2004-11-03 22:15 UTC (permalink / raw)
  To: DervishD; +Cc: Gene Heskett, linux-kernel, Måns Rullgård

DervishD wrote:
>     Hi Gene :)
> 
>  * Gene Heskett <gene.heskett@verizon.net> dixit:
> 
>>>   Then the children are reparented to 'init' and 'init' gets rid
>>>of them. That's the way UNIX behaves.
>>
>>Unforch, I've *never* had it work that way.  Any dead process I've 
>>ever had while running linux has only been disposable by a reboot.
> 
> 
>     Well, you know, shit happens... Anyway, could you define 'dead'?
> Because if you're talking about zombies whose parent dies, they're
> killable easily: just wait until init reaps them (usually in less
> than 5 minutes since they dead). If you are talking about zombies who
> has their parent alive, then it's a bug in the application, not the
> kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
> 
>     If you're talking about unkillable processes (those stuck in
> disk-sleep state), you're right: only rebooting can kill them
> (although sometimes they go out of D state and die normally). Bad
> luck for you if any dead process you've ever had while running linux
> has been of this kind :(
> 

I did this to myself a number of times when I was first learning Samba - even an 
ls would become unkillable.  You couldn't rmmod smb, since it was in use, and you 
couldn't kill the process, since it was waiting on a syscall.  Ergh.

>     Raúl Núñez de Arenas Coronado
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:15               ` Jim Nelson
@ 2004-11-03 22:44                 ` Russell Miller
  2004-11-03 23:03                   ` Doug McNaught
                                     ` (2 more replies)
  2004-11-04 16:30                 ` Pedro Venda (SYSADM)
  1 sibling, 3 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-03 22:44 UTC (permalink / raw)
  To: Jim Nelson; +Cc: DervishD, Gene Heskett, linux-kernel, Måns Rullgård

On Wednesday 03 November 2004 16:15, Jim Nelson wrote:

> I did this to myself a number of times when I was first learning Samba -
> even an ls would become unkillable.  You couldn't rmmod smb, since it was
> in use, and you couldn't kill the process, since it was waiting on a
> syscall.  Ergh.
>

I'm not going to pretend to be a kernel expert, or really anything other than 
a newbie when it comes to kernel internals, so please take this with the 
merits it deserves - many, or none, depending.

Anyway, is there a way to simply signal a syscall that it is to be interrupted 
and forcibly cause the syscall to end?  Kicking the program execution out of 
kernel space would be sufficient to "unstick" the process - and coupling that 
with an automatic KILL signal may not be a bad idea.

I'm pretty sure that someone will think of a way why this wouldn't work with 
very little effort.  Please enlighten me?

--Russell

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 15:25     ` DervishD
  2004-11-03 15:25       ` Måns Rullgård
  2004-11-03 16:47       ` Gene Heskett
@ 2004-11-03 22:58       ` Bill Davidsen
  2004-11-04 10:23         ` DervishD
  2004-11-03 23:18       ` Adam Heath
  3 siblings, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-03 22:58 UTC (permalink / raw)
  To: linux-kernel, DervishD; +Cc: Måns Rullgård, linux-kernel

DervishD wrote:
>     Hi all :)
> 
>  * Måns Rullgård <mru@inprovide.com> dixit:
> 
>>>>I'd tried to kill the zombie earlier but couldn't.
>>>>Isn't there some way to clean up a &^$#^#@)_ zombie?
>>>
>>>Kill the parent, is the only (portable) way.
>>
>>Perhaps not as portable, but another possible, though slightly
>>complicated, way is to ptrace the parent and force it to wait().
> 
> 
>     Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

You can't wait() for the process, you have to use waitfor(), and the 
last time I tried that it didn't work, although I don't remember the 
symptom beyond that.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:44                 ` Russell Miller
@ 2004-11-03 23:03                   ` Doug McNaught
  2004-11-03 23:33                     ` Russell Miller
  2004-11-03 23:06                   ` vlobanov
  2004-11-04 10:04                   ` Helge Hafting
  2 siblings, 1 reply; 99+ messages in thread
From: Doug McNaught @ 2004-11-03 23:03 UTC (permalink / raw)
  To: Russell Miller
  Cc: Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Russell Miller <rmiller@duskglow.com> writes:

> Anyway, is there a way to simply signal a syscall that it is to be
> interrupted and forcibly cause the syscall to end?  Kicking the
> program execution out of kernel space would be sufficient to
> "unstick" the process - and coupling that with an automatic KILL
> signal may not be a bad idea.

It was already mentioned in this thread that the bookkeeping required
to clean up properly from such an abort would add a lot of overhead
and slow down the normal, non-buggy case.

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:44                 ` Russell Miller
  2004-11-03 23:03                   ` Doug McNaught
@ 2004-11-03 23:06                   ` vlobanov
  2004-11-04 10:04                   ` Helge Hafting
  2 siblings, 0 replies; 99+ messages in thread
From: vlobanov @ 2004-11-03 23:06 UTC (permalink / raw)
  To: Russell Miller
  Cc: Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Also a kernel newbie here, so apply appropriate amount of salt to
response. :)

One common scenario for why a program is blocked within a syscall is
that it is waiting for data to arrive. Consider, for example, a read()
on a file -- simplifying a lot, the data has to be fetched from disk,
which is slow. So, while the disk is doing it's thing, the program is
blocked within the system call. Then, when an interrupt arrives
signalling that the data is ready, it is placed into the user-space
buffer, and the program is kicked out of the syscall so that it can
continue executing.

Consider what happens if the program suddenly dies within the read()
syscall above: when the data from disk comes back, the kernel needs to
figure out where to put it. This would make for a very confused kernel,
since the original requester "vanished" without a trace. Even worse,
another program might have taken the original program's place in the
meantime! Very bad things happen.

This is certainly not an _impossible_ problem to solve (as far as I
know), but solving it in the general case would involve a lot of
expensive and complex book-keeping code, so it's simply not done.

Am I right? Wrong? Please enlighten me as well. :)

-Vadim Lobanov

On Wed, 3 Nov 2004, Russell Miller wrote:

> On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
>
> > I did this to myself a number of times when I was first learning Samba -
> > even an ls would become unkillable.  You couldn't rmmod smb, since it was
> > in use, and you couldn't kill the process, since it was waiting on a
> > syscall.  Ergh.
> >
>
> I'm not going to pretend to be a kernel expert, or really anything other than
> a newbie when it comes to kernel internals, so please take this with the
> merits it deserves - many, or none, depending.
>
> Anyway, is there a way to simply signal a syscall that it is to be interrupted
> and forcibly cause the syscall to end?  Kicking the program execution out of
> kernel space would be sufficient to "unstick" the process - and coupling that
> with an automatic KILL signal may not be a bad idea.
>
> I'm pretty sure that someone will think of a way why this wouldn't work with
> very little effort.  Please enlighten me?
>
> --Russell
>
> --
>
> Russell Miller - rmiller@duskglow.com - Le Mars, IA
> Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
> http://www.duskglow.com - 712-546-5886
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:26             ` DervishD
  2004-11-03 20:18               ` Gene Heskett
  2004-11-03 22:15               ` Jim Nelson
@ 2004-11-03 23:07               ` Bill Davidsen
  2004-11-04  1:19                 ` Michael Clark
  2 siblings, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-03 23:07 UTC (permalink / raw)
  To: linux-kernel, DervishD
  Cc: Gene Heskett, linux-kernel, Måns Rullgård

DervishD wrote:
>     Hi Gene :)
> 
>  * Gene Heskett <gene.heskett@verizon.net> dixit:
> 
>>>   Then the children are reparented to 'init' and 'init' gets rid
>>>of them. That's the way UNIX behaves.
>>
>>Unforch, I've *never* had it work that way.  Any dead process I've 
>>ever had while running linux has only been disposable by a reboot.
> 
> 
>     Well, you know, shit happens... Anyway, could you define 'dead'?
> Because if you're talking about zombies whose parent dies, they're
> killable easily: just wait until init reaps them (usually in less
> than 5 minutes since they dead). If you are talking about zombies who
> has their parent alive, then it's a bug in the application, not the
> kernel. In fact I wouldn't like if the kernel reaps my children
> before I do, just in case I want to do something.
> 
>     If you're talking about unkillable processes (those stuck in
> disk-sleep state), you're right: only rebooting can kill them
> (although sometimes they go out of D state and die normally). Bad
> luck for you if any dead process you've ever had while running linux
> has been of this kind :(

That often seems to be the case, the kernel thinks there's an i/o going 
on which isn't, and doesn't time it out. It would be nice if there were 
a way to get the kernel to abort all outstanding i/o on kill -9, but I'm 
sure if it were easy it would have happened. Timeouts in the application 
are useful, but in some cases I believe the process dies because it 
detects a long i/o time but has nothing to do but terminate, which 
creates the zombie.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 19:42                 ` DervishD
@ 2004-11-03 23:12                   ` Bill Davidsen
  2004-11-04 10:26                     ` DervishD
  0 siblings, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-03 23:12 UTC (permalink / raw)
  To: linux-kernel, DervishD
  Cc: Gene Heskett, linux-kernel, Valdis.Kletnieks, Måns Rullgård

DervishD wrote:
>     Hi Gene :)
> 
>  * Gene Heskett <gheskett@wdtv.com> dixit:
> 
>>>Traditionally, a common cause for such wedging was a lost/misplaced
>>>interrupt from an I/O operation, so a read()/write()/ioctl() call
>>>wouldn't return because the device hadn't reported it completed.
>>>(tape drives were notorious for this). Often, power-cycling the I/O
>>>device would cause an unsolicited interrupt to be generated, which
>>>would clear the "waiting for interrupt" issue and allow the process
>>>to return....
>>
>>Well, since the "device", a bt878 based Haupagge tv card is sitting in 
>>a pci socket, thats even more drastic than a reboot.
> 
> 
>     Do you mean your Hauppage got stuck in disk-sleep state? Wow,
> that's sound *weird*...
> 
>     I think that the parent (which is whatever process did the fork
> when you clicked your mouse) is still alive and forgetting to do the
> 'wait()' for its children.

It would be good to know what the PPID is, from ps or similar. Things 
from X are a pain, the parent is often something you don't want to kill. 
Sometimes you can reparent from command line, "bash -c foo&" or similar, 
so the parent can be killed without logging out.

I would swear that the parent *is* init in some cases, which is puzzling 
since they should be reaped.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 15:25     ` DervishD
                         ` (2 preceding siblings ...)
  2004-11-03 22:58       ` Bill Davidsen
@ 2004-11-03 23:18       ` Adam Heath
  3 siblings, 0 replies; 99+ messages in thread
From: Adam Heath @ 2004-11-03 23:18 UTC (permalink / raw)
  To: DervishD; +Cc: Måns Rullgård, linux-kernel

On Wed, 3 Nov 2004, DervishD wrote:

>     Hi all :)
>
>  * Måns Rullgård <mru@inprovide.com> dixit:
> > >> I'd tried to kill the zombie earlier but couldn't.
> > >> Isn't there some way to clean up a &^$#^#@)_ zombie?
> > > Kill the parent, is the only (portable) way.
> > Perhaps not as portable, but another possible, though slightly
> > complicated, way is to ptrace the parent and force it to wait().
>
>     Or write a little program that just 'wait()'s for the specified
> PID's. That is perfectly portable IMHO. But I must admit that the
> preferred way should be killing the parent. 'init' will reap the
> children after that.

ptrace the parent, cause it to wait() for it's children, then change IP, etc.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:03                   ` Doug McNaught
@ 2004-11-03 23:33                     ` Russell Miller
  2004-11-03 23:47                       ` Mathieu Segaud
                                         ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-03 23:33 UTC (permalink / raw)
  To: Doug McNaught
  Cc: Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

On Wednesday 03 November 2004 17:03, Doug McNaught wrote:

> It was already mentioned in this thread that the bookkeeping required
> to clean up properly from such an abort would add a lot of overhead
> and slow down the normal, non-buggy case.
>
I am going to continue pursuing this at the risk of making a bigger fool of 
myself than I already am, but I want to make sure that I understand the 
issues - and I did read the message you are referring to.

I think what you are saying is that there is kind of a race condition here.  
When something is on the wait queue, it has to be followed through to 
completion.  An interrupt could be received at any time, and if it's taken 
off of the wait queue prematurely, it'll crash the kernel, because the 
interrupt has no way of telling that.

That's fine as it goes, I understand that.  But I submit that this is a 
horrible design.  I've been bitten by this more than once - usually regarding 
broken NFS connections.

But what I don't understand is why the bookkeeping would be so inefficient.  
It seems to me that all that would be required is a bitfield of some sort.  
If that position in the qait queue becomes invalid, when the interrupt is 
received to process it, the kernel notes that a flag is set invalidating that 
part of the wait queue, dumps the output to dave null, and goes on to the 
next.  This doesn't seem inefficient to me, unless I'm missing something.
A little more inefficient, yes, but not to near the cost that seems to be 
implied.

And I also have to ask this question:  what is more inefficient, slowing down 
processing of output waiting on the queue, or having to reboot when a process 
gets stuck due to faulty drivers?  At the very least, a compile option seems 
like it would be worthwhile for those that would like this behavior.

And I probably am.  Missing something, that is.

--Russell

> -Doug

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:33                     ` Russell Miller
@ 2004-11-03 23:47                       ` Mathieu Segaud
  2004-11-03 23:56                         ` Russell Miller
  2004-11-04  6:39                       ` Denis Vlasenko
  2004-11-04 20:06                       ` Bill Davidsen
  2 siblings, 1 reply; 99+ messages in thread
From: Mathieu Segaud @ 2004-11-03 23:47 UTC (permalink / raw)
  To: Russell Miller
  Cc: Doug McNaught, Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Russell Miller <rmiller@duskglow.com> disait dernièrement que :

> I am going to continue pursuing this at the risk of making a bigger fool of 
> myself than I already am, but I want to make sure that I understand the 
> issues - and I did read the message you are referring to.
>
> I think what you are saying is that there is kind of a race condition here.  
> When something is on the wait queue, it has to be followed through to 
> completion.  An interrupt could be received at any time, and if it's taken 
> off of the wait queue prematurely, it'll crash the kernel, because the 
> interrupt has no way of telling that.
>
> That's fine as it goes, I understand that.  But I submit that this is a 
> horrible design.  I've been bitten by this more than once - usually regarding 
> broken NFS connections.

this is because nfs related syscalls are not interruptible by default.
you can make them interruptible by mounting your nfs's with the 'intr' option.

-- 
I love people saying 'we' even though they never contributed a single
line of code to the project!

	- Jens Axboe turning a troll down on linux-kernel


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:47                       ` Mathieu Segaud
@ 2004-11-03 23:56                         ` Russell Miller
  2004-11-04  0:05                           ` Mathieu Segaud
  0 siblings, 1 reply; 99+ messages in thread
From: Russell Miller @ 2004-11-03 23:56 UTC (permalink / raw)
  To: Mathieu Segaud
  Cc: Doug McNaught, Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

On Wednesday 03 November 2004 17:47, Mathieu Segaud wrote:

> this is because nfs related syscalls are not interruptible by default.
> you can make them interruptible by mounting your nfs's with the 'intr'
> option.

That doesn't appear to work, then.  Because we do mount them with the intr 
option, and the behavior doesn't seem to be any different.

--Russell

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:56                         ` Russell Miller
@ 2004-11-04  0:05                           ` Mathieu Segaud
  0 siblings, 0 replies; 99+ messages in thread
From: Mathieu Segaud @ 2004-11-04  0:05 UTC (permalink / raw)
  To: Russell Miller
  Cc: Doug McNaught, Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Russell Miller <rmiller@duskglow.com> disait dernièrement que :

> On Wednesday 03 November 2004 17:47, Mathieu Segaud wrote:
>
>> this is because nfs related syscalls are not interruptible by default.
>> you can make them interruptible by mounting your nfs's with the 'intr'
>> option.
>
> That doesn't appear to work, then.  Because we do mount them with the intr 
> option, and the behavior doesn't seem to be any different.

weird, it works by here.... I can even umount() lost shares....

NFS is quite an unknown beast to me, sorry...
But it is clearly a bug, if you do mount them with -o intr...

-- 
<ajh> I always viewed HURD development like the Special Olympics of free software.

	- Is Hurd a opponent to Linux?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:40       ` Gene Heskett
@ 2004-11-04  0:43         ` Kurt Wall
  2004-11-04  1:01           ` Russell Miller
  2004-11-04 10:07         ` Matthias Andree
  1 sibling, 1 reply; 99+ messages in thread
From: Kurt Wall @ 2004-11-04  0:43 UTC (permalink / raw)
  To: linux-kernel

On Wed, Nov 03, 2004 at 03:40:03PM -0500, Gene Heskett took 89 lines to write:
> On Wednesday 03 November 2004 15:13, Helge Hafting wrote:
> >
> >Yes it does - the problem is that not all resources are managed
> >by processes.  Some allocations are managed by drivers, so a driver
> >bug can get the device into a unuseable state _and_ tie up the
> >process(es) that were using the driver at the moment.
> 
> This from my viewpoint, is wrong.  The kernel, and only the kernel 
> should be ultimately responsible for handing out resources, and 
> reclaiming at its convienience.

This might just be semantics, but device drivers are part of the kernel.

Kurt
-- 
In 1750 Issac Newton became discouraged when he fell up a flight of
stairs.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  0:43         ` Kurt Wall
@ 2004-11-04  1:01           ` Russell Miller
  2004-11-04  1:38             ` Doug McNaught
  0 siblings, 1 reply; 99+ messages in thread
From: Russell Miller @ 2004-11-04  1:01 UTC (permalink / raw)
  To: Kurt Wall; +Cc: linux-kernel

On Wednesday 03 November 2004 18:43, Kurt Wall wrote:

> This might just be semantics, but device drivers are part of the kernel.
>
This brings up another question I've had since reading the documentation on 
later pentium-class chips:

why are only rings 0 and 3 used in linux?

--Russell

> Kurt

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:07               ` Bill Davidsen
@ 2004-11-04  1:19                 ` Michael Clark
  0 siblings, 0 replies; 99+ messages in thread
From: Michael Clark @ 2004-11-04  1:19 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: linux-kernel, DervishD, Gene Heskett, Måns Rullgård

On 11/04/04 07:07, Bill Davidsen wrote:
> DervishD wrote:
> 
>>     Hi Gene :)
>>
>>  * Gene Heskett <gene.heskett@verizon.net> dixit:
>>
>>>>   Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>>
>>> Unforch, I've *never* had it work that way.  Any dead process I've 
>>> ever had while running linux has only been disposable by a reboot.
>>
>>
>>
>>     Well, you know, shit happens... Anyway, could you define 'dead'?
>> Because if you're talking about zombies whose parent dies, they're
>> killable easily: just wait until init reaps them (usually in less
>> than 5 minutes since they dead). If you are talking about zombies who
>> has their parent alive, then it's a bug in the application, not the
>> kernel. In fact I wouldn't like if the kernel reaps my children
>> before I do, just in case I want to do something.
>>
>>     If you're talking about unkillable processes (those stuck in
>> disk-sleep state), you're right: only rebooting can kill them
>> (although sometimes they go out of D state and die normally). Bad
>> luck for you if any dead process you've ever had while running linux
>> has been of this kind :(
> 
> 
> That often seems to be the case, the kernel thinks there's an i/o going 
> on which isn't, and doesn't time it out. It would be nice if there were 
> a way to get the kernel to abort all outstanding i/o on kill -9, but I'm 
> sure if it were easy it would have happened. Timeouts in the application 
> are useful, but in some cases I believe the process dies because it 
> detects a long i/o time but has nothing to do but terminate, which 
> creates the zombie.

It could be any driver code that uses uninterruptible sleeps rather
than interruptible sleeps I believe. If a process is doing a read or
write to one of these devices and it stays stuck in kernel code with
TASK_UNINTERRUPTIBLE and never gets it's expected wake up, then the
signal will never be delivered and the process is stuck indefinately.
The buggy driver code needs to be fixed (either to use interruptible
sleeps and handle the signals or to imlement some sort of timeout).

~mc

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  1:01           ` Russell Miller
@ 2004-11-04  1:38             ` Doug McNaught
  2004-11-04  1:45               ` Russell Miller
  0 siblings, 1 reply; 99+ messages in thread
From: Doug McNaught @ 2004-11-04  1:38 UTC (permalink / raw)
  To: Russell Miller; +Cc: Kurt Wall, linux-kernel

Russell Miller <rmiller@duskglow.com> writes:

> This brings up another question I've had since reading the documentation on 
> later pentium-class chips:
>
> why are only rings 0 and 3 used in linux?

Because the "traditional" Unix privilege model only has two levels,
and Linux runs on many architectures, most of which have only two
privilege levels (the 68000 called them "user" and "supervisor").
Special-casing x86 is possible but probably wouldn't be worth it.

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  1:38             ` Doug McNaught
@ 2004-11-04  1:45               ` Russell Miller
  2004-11-04  1:56                 ` Doug McNaught
  2004-11-04  1:59                 ` Mitchell Blank Jr
  0 siblings, 2 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-04  1:45 UTC (permalink / raw)
  To: Doug McNaught; +Cc: Kurt Wall, linux-kernel

On Wednesday 03 November 2004 19:38, Doug McNaught wrote:
> Russell Miller <rmiller@duskglow.com> writes:
> > This brings up another question I've had since reading the documentation
> > on later pentium-class chips:
> >
> > why are only rings 0 and 3 used in linux?
>
> Because the "traditional" Unix privilege model only has two levels,
> and Linux runs on many architectures, most of which have only two
> privilege levels (the 68000 called them "user" and "supervisor").
> Special-casing x86 is possible but probably wouldn't be worth it.
>
Wouldn't it help with device driver problems?  Couldn't ring 1 be used to make 
sure an errant driver doesn't drop the kernel, at least on x86 machines?

I remember the 68000 architecture.  Quite nice (but I was 10 when I studied 
it, so..).

--Russell

> -Doug

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  1:45               ` Russell Miller
@ 2004-11-04  1:56                 ` Doug McNaught
  2004-11-04  1:59                 ` Mitchell Blank Jr
  1 sibling, 0 replies; 99+ messages in thread
From: Doug McNaught @ 2004-11-04  1:56 UTC (permalink / raw)
  To: Russell Miller; +Cc: Kurt Wall, linux-kernel

Russell Miller <rmiller@duskglow.com> writes:

> Wouldn't it help with device driver problems?  Couldn't ring 1 be
> used to make sure an errant driver doesn't drop the kernel, at least
> on x86 machines?

As I understand it:

1) Ring transitions aren't free.
2) The API between drivers and kernel is always in flux; drivers
   expect to be able to access internal kernel data structures.
   Making drivers run in ring 1 on even one of the N architectures
   would be a major refactoring and would constrain API changes.
   Freezing the internal API is something the developers don't want to
   do.
3) There are probably plenty of ways for a buggy driver to crash the
   kernel even if it's running in ring 1 (turn off interrupts and
   leave them off, etc).

So the upshot is that it's probably not worth the work and portability
hassles.

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  1:45               ` Russell Miller
  2004-11-04  1:56                 ` Doug McNaught
@ 2004-11-04  1:59                 ` Mitchell Blank Jr
  2004-11-04 20:10                   ` Bill Davidsen
  1 sibling, 1 reply; 99+ messages in thread
From: Mitchell Blank Jr @ 2004-11-04  1:59 UTC (permalink / raw)
  To: Russell Miller; +Cc: linux-kernel

Russell Miller wrote:
> Couldn't ring 1 be used to make 
> sure an errant driver doesn't drop the kernel, at least on x86 machines?

Not really -- drivers could still do things like mis-program their associated
hardware making it do DMA writes all over kernel memory (just as one example)

Basically it'd add a lot of complexity (and inefficiency) without adding
much real safety.

-Mitch

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:33                     ` Russell Miller
  2004-11-03 23:47                       ` Mathieu Segaud
@ 2004-11-04  6:39                       ` Denis Vlasenko
  2004-11-05  2:38                         ` Elladan
  2004-11-04 20:06                       ` Bill Davidsen
  2 siblings, 1 reply; 99+ messages in thread
From: Denis Vlasenko @ 2004-11-04  6:39 UTC (permalink / raw)
  To: Russell Miller, Doug McNaught
  Cc: Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

On Thursday 04 November 2004 01:33, Russell Miller wrote:
> On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
> 
> > It was already mentioned in this thread that the bookkeeping required
> > to clean up properly from such an abort would add a lot of overhead
> > and slow down the normal, non-buggy case.
> >
> I am going to continue pursuing this at the risk of making a bigger fool of 
> myself than I already am, but I want to make sure that I understand the 
> issues - and I did read the message you are referring to.
> 
> I think what you are saying is that there is kind of a race condition here.  
> When something is on the wait queue, it has to be followed through to 
> completion.  An interrupt could be received at any time, and if it's taken 
> off of the wait queue prematurely, it'll crash the kernel, because the 
> interrupt has no way of telling that.

The problem is in locking. You must not kill process while it is
in uninterruptible state because it is uninterruptible
for a reason - has taken semaphore, or get_cpu(), etc.
You do want it to do put_cpu(), right?

Processes must never get stuck in D, it's a kernel bug.

Find out how did process ended up in D state forever,
and fix it - that's what I'm trying to do
in these cases.
--
vda


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 21:08   ` Gene Heskett
@ 2004-11-04  7:19     ` Jan Knutar
  2004-11-04 11:57       ` Gene Heskett
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Knutar @ 2004-11-04  7:19 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Tom Felker

On Wednesday 03 November 2004 23:08, Gene Heskett wrote:

> Its a dead horse Tom, lets bury it.  I've rebooted to 4 new kernels 
> since that time as I march toward getting caught up with whatever 
> bk(nn) is out today.  Other than that, which took place on bk7's 
> watch, its all working rather well.

Since nobody else seems to have said it, it would be a good idea
to enable sysrq and do a sysrq-T the next time (if) this happens,
so that there would be atleast some information to go on.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:44                 ` Russell Miller
  2004-11-03 23:03                   ` Doug McNaught
  2004-11-03 23:06                   ` vlobanov
@ 2004-11-04 10:04                   ` Helge Hafting
  2004-11-04 17:16                     ` Alex Bennee
  2 siblings, 1 reply; 99+ messages in thread
From: Helge Hafting @ 2004-11-04 10:04 UTC (permalink / raw)
  To: Russell Miller
  Cc: Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Russell Miller wrote:

>On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
>
>  
>
>>I did this to myself a number of times when I was first learning Samba -
>>even an ls would become unkillable.  You couldn't rmmod smb, since it was
>>in use, and you couldn't kill the process, since it was waiting on a
>>syscall.  Ergh.
>>
>>    
>>
>
>I'm not going to pretend to be a kernel expert, or really anything other than 
>a newbie when it comes to kernel internals, so please take this with the 
>merits it deserves - many, or none, depending.
>
>Anyway, is there a way to simply signal a syscall that it is to be interrupted 
>and forcibly cause the syscall to end? 
>
There is a way.  Processes go into D state happens all the time
when waiting for disk io or similiar.  Then the io happens a few ms later,
and the fs or device driver tells the kernel to wake up the process
so it gets a chance at the next scheduling opportunity. So the mechanism to
unstick a prcess exists, and is used by every device driver that
use sleeping.  Which is most of them.

Breakage happens when something never comes out of D-state.
One could write a trivial syscall (or addition to "kill") that "wakes"
processes waiting for io.  It itsn't hard to do at all - just copy the
waking code from any device driver.  This will allow to kill and
fully remove any process that hangs around in D-state.  This might
also release other stuck resources as the syscall
continues, returns to userspace, and allows the process to die.

Unfortunately, this isn't enough.  In some cases the syscall
expects the io device interrupt handler to have done something
vital - but this haven't happened when we forcibly wakes a process.
We can hope for an io error, but might get a crash instead. This
can be fixes with a lot of work - basically check at every wakeup
if the process were woken by this new killing mechanism and
act accordingly.  It shouldn't be hard, but _lots_ of work
inspecting every sleeping point, at least every device driver.

Another problem exist if the long-waiting io wasn't lost - just 
extremely slow.
If the io actually comes through after the process is gone and the memory
is used for something else - bang!  Dealing properly with this case
is harder - a new generic mechanism for cancelling outstanding io
requests is needed for this.
It might even be impossible in some cases.  If a memory address is handed
over to a bus-mastering device such as a scsi adapter, then the memory
must be pinned down until the operation completes.  It cannot be released.
The rest of the process can go, but the hw might not support any way
of cancelling the request.  A few may have a way, many won't.  Some devices
can be reset - but at a considerable cost.  A disk controller might be 
unavailable
for seconds during such a reset - instant DOS attack if a user keeps 
starting lots of
disk intensive processes and kill them off while in a D-state that 
normally last way shorter than a reset.  PCI devices can be turned off, 
but we might really want
to use them again . . .

Fortunately, most cases of long-running D-state is just driver bugs and
can be fixed as such.  nfs has a forced umount option.  If samba can 
hang, then
it _can_ be fixed in similiar ways.  (smbfs is software only - no quirky 
hw to deal with.)
Hw drivers that puts processes into everlasting D-state usually do so 
because of
a bug. (Lost request or interrupt because of internal errors.)  Fix 
that, and the
problem never happens.  So the hard problem of killing stuff stuck in 
D-state
doesn't need a solution - fix the real bug instead.  Having a way to 
kill such processes
will only mean that hard-to-trigger bugs won't get fixed because there is
workaround.  This is bad for stability too, as broken hw drivers can 
hang the
kernel even if a better process killer comes into existence.

> Kicking the program execution out of 
>kernel space would be sufficient to "unstick" the process - and coupling that 
>with an automatic KILL signal may not be a bad idea.
>
>I'm pretty sure that someone will think of a way why this wouldn't work with 
>very little effort.  Please enlighten me?
>  
>
It is doable - but not with "very little effort".  I have outlined above 
the trouble
you get if you trivially wake up the sleeping process.  Another trivial 
alternative
is to remove the process while it is in-kernel.  The downside is that it 
might
be holding a lock or semaphore that won't ever be released this way.  
And no,
locks aren't necessarily accounted for anywhere.  (They are implicitly
accounted for by the fact that a process exists whose future execution
path leads to the release of said lock.)  Explicit accounting that allows
lock-breaking is deemed too slow, and what to do about the data structures
the lock/semaphore were protecting?

The stuck process is a sign of another bug - better fix that one.

Helge Hafting





^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:40       ` Gene Heskett
  2004-11-04  0:43         ` Kurt Wall
@ 2004-11-04 10:07         ` Matthias Andree
  2004-11-04 22:31           ` Peter Chubb
  2004-11-04 23:33           ` Benno
  1 sibling, 2 replies; 99+ messages in thread
From: Matthias Andree @ 2004-11-04 10:07 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel

On Wed, 03 Nov 2004, Gene Heskett wrote:

> >Yes it does - the problem is that not all resources are managed
> >by processes.  Some allocations are managed by drivers, so a driver
> >bug can get the device into a unuseable state _and_ tie up the
> >process(es) that were using the driver at the moment.
> 
> This from my viewpoint, is wrong.  The kernel, and only the kernel 
> should be ultimately responsible for handing out resources, and 
> reclaiming at its convienience.

Linux's driver model is the way it is. If you want the kernel to clean
up after a driver has puked, you need something like a microkernel I
believe, where only a minimal core kernel is a real kernel and where all
the drivers are actually in user-space, but that's no longer Linux then.

I'm not reflecting the down- and upsides to of this as I have no
experience with microkernels (and have never used OS9 or GNU Hurd
either). I know there have been attempts to port Linux to a Microkernel
but I don't know what's come out of it.

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:58       ` Bill Davidsen
@ 2004-11-04 10:23         ` DervishD
  2004-11-04 19:32           ` Bill Davidsen
  0 siblings, 1 reply; 99+ messages in thread
From: DervishD @ 2004-11-04 10:23 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Måns Rullgård, linux-kernel

    Hi Bill :)

 * Bill Davidsen <davidsen@tmr.com> dixit:
> >    Or write a little program that just 'wait()'s for the specified
> >PID's. That is perfectly portable IMHO. But I must admit that the
> >preferred way should be killing the parent. 'init' will reap the
> >children after that.
> You can't wait() for the process, you have to use waitfor(), and the 
> last time I tried that it didn't work, although I don't remember the 
> symptom beyond that.

    You can't wait for other's children. OTOH, if we talk about your
children, you can do wait() or waitpid() (I assume that you referred
to waitpid(), since there isn't waitfor() AFAIK). The only difference
is that wait suspends the process until information from a child is
available.

    If you are talking about others' children, then your call to
waitpid() (or wait()) failed with ECHILD: not your child.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:12                   ` Bill Davidsen
@ 2004-11-04 10:26                     ` DervishD
  2004-11-04 14:23                       ` Paul Slootman
  2004-11-04 19:22                       ` Bill Davidsen
  0 siblings, 2 replies; 99+ messages in thread
From: DervishD @ 2004-11-04 10:26 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Gene Heskett, linux-kernel, Valdis.Kletnieks, Måns Rullgård

    Hi Bill :)

 * Bill Davidsen <davidsen@tmr.com> dixit:
> >    I think that the parent (which is whatever process did the fork
> >when you clicked your mouse) is still alive and forgetting to do the
> >'wait()' for its children.
> It would be good to know what the PPID is, from ps or similar. Things 
> from X are a pain, the parent is often something you don't want to kill. 
> Sometimes you can reparent from command line, "bash -c foo&" or similar, 
> so the parent can be killed without logging out.

    Just use ps to reveal the family tree. Is not that hard ;)
 
> I would swear that the parent *is* init in some cases, which is puzzling 
> since they should be reaped.

    But that's OK :))) When a parent dies without waiting for its
children, the zombies are reparented to init. That's correct. Then
init will wait for them. The problem is that sometimes the signals
doesn't arrive or the like. Then the zombies are laying around a bit,
until a timer in 'init' reaps them. That's correct too: init can only
wait for children when it receives SIGCHLD or periodically, using a
timer. I've written a init program and that's the way I do it, just
in case some signal gets lost.

    If init is the parent, all works ok, just wait a bit and all
those zombies will really die ;)

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  7:19     ` Jan Knutar
@ 2004-11-04 11:57       ` Gene Heskett
  2004-11-04 12:12         ` Jan Knutar
  0 siblings, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 11:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Knutar, Tom Felker

On Thursday 04 November 2004 02:19, Jan Knutar wrote:
>On Wednesday 03 November 2004 23:08, Gene Heskett wrote:
>> Its a dead horse Tom, lets bury it.  I've rebooted to 4 new
>> kernels since that time as I march toward getting caught up with
>> whatever bk(nn) is out today.  Other than that, which took place
>> on bk7's watch, its all working rather well.
>
>Since nobody else seems to have said it, it would be a good idea
>to enable sysrq and do a sysrq-T the next time (if) this happens,
>so that there would be atleast some information to go on.

I'e had that turned on since forever Jan, but usually, when its hung 
someplace, its well and truely hung, and hardware reset button time.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 11:57       ` Gene Heskett
@ 2004-11-04 12:12         ` Jan Knutar
  2004-11-04 12:18           ` Gene Heskett
  2004-11-04 12:39           ` Gene Heskett
  0 siblings, 2 replies; 99+ messages in thread
From: Jan Knutar @ 2004-11-04 12:12 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Tom Felker

On Thursday 04 November 2004 13:57, Gene Heskett wrote:

> I'e had that turned on since forever Jan, but usually, when its hung 
> someplace, its well and truely hung, and hardware reset button time.

Are you saying that these zombies (or tasks stuck in state D) also make
sysrq-T hang, and not list all tasks?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:12         ` Jan Knutar
@ 2004-11-04 12:18           ` Gene Heskett
  2004-11-04 12:29             ` Jan Knutar
  2004-11-04 12:39           ` Gene Heskett
  1 sibling, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 12:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Knutar, Tom Felker

On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> I'e had that turned on since forever Jan, but usually, when its
>> hung someplace, its well and truely hung, and hardware reset
>> button time.
>
>Are you saying that these zombies (or tasks stuck in state D) also
> make sysrq-T hang, and not list all tasks?

The machine is hung.  No ssh, no ping response, the only button that 
works is the hardware reset on the front of the tower.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:18           ` Gene Heskett
@ 2004-11-04 12:29             ` Jan Knutar
  2004-11-04 13:56               ` Gene Heskett
  0 siblings, 1 reply; 99+ messages in thread
From: Jan Knutar @ 2004-11-04 12:29 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Tom Felker

On Thursday 04 November 2004 14:18, Gene Heskett wrote:

> The machine is hung.  No ssh, no ping response, the only button that 
> works is the hardware reset on the front of the tower.

I must've missed where the thread went from zombies into totally hung
machine. My apologies for the noise.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:12         ` Jan Knutar
  2004-11-04 12:18           ` Gene Heskett
@ 2004-11-04 12:39           ` Gene Heskett
  2004-11-04 13:01             ` Ian Campbell
                               ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 12:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Knutar, Tom Felker

On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> I'e had that turned on since forever Jan, but usually, when its
>> hung someplace, its well and truely hung, and hardware reset
>> button time.
>
>Are you saying that these zombies (or tasks stuck in state D) also
> make sysrq-T hang, and not list all tasks?

I thought I'd test it right now while the system is runnng normally, 
but I got only a beep from the console, so I went to 
Documentation/sysrq.txt to make sure I was doing it right, and it is 
_not_ working right now.  But it is compiled in according to a make 
xconfig, or a grep of the .config.

[root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
CONFIG_MAGIC_SYSRQ=y

I get a couple of beeps from the console, but thats the limit of the 
response, and a tail -f on the log shows nothing.  I also logged into  
VC2, and tried it there, but that attempt didn't even get me a beep, 
several times.

The keyboard is a cheap ($24) M$ with a few extra buttons that don't 
do anything along the top.  And getting a bit creaky in its old age, 
a lot like me, but I'm about 68 years older than the keyboard :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:39           ` Gene Heskett
@ 2004-11-04 13:01             ` Ian Campbell
  2004-11-04 14:07               ` Gene Heskett
  2004-11-04 13:10             ` Doug McNaught
  2004-11-04 20:18             ` Bill Davidsen
  2 siblings, 1 reply; 99+ messages in thread
From: Ian Campbell @ 2004-11-04 13:01 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Jan Knutar, Tom Felker

On Thu, 2004-11-04 at 07:39 -0500, Gene Heskett wrote:
> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
> >On Thursday 04 November 2004 13:57, Gene Heskett wrote:
> >> I'e had that turned on since forever Jan, but usually, when its
> >> hung someplace, its well and truely hung, and hardware reset
> >> button time.
> >
> >Are you saying that these zombies (or tasks stuck in state D) also
> > make sysrq-T hang, and not list all tasks?
> 
> I thought I'd test it right now while the system is runnng normally, 
> but I got only a beep from the console, so I went to 
> Documentation/sysrq.txt to make sure I was doing it right, and it is 
> _not_ working right now.  But it is compiled in according to a make 
> xconfig, or a grep of the .config.

It can also be enabled/disabled at runtime, Documentation/sysrq.txt says
that the default now is on (but that it used to default to off). Perhaps
it is getting turned off somewhere in your boot scripts etc. 

You can check with

$ cat /proc/sys/kernel/sysrq
1

> The keyboard is a cheap ($24) M$ with a few extra buttons that don't 
> do anything along the top.  And getting a bit creaky in its old age, 
> a lot like me, but I'm about 68 years older than the keyboard :)

Documentation/sysrq.txt also says:

*  How do I use the magic SysRq key?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On x86   - You press the key combo 'ALT-SysRq-<command key>'. Note - Some
           keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
           also known as the 'Print Screen' key. Also some keyboards cannot
           handle so many keys being pressed at the same time, so you might
           have better luck with "press Alt", "press SysRq", "release Alt",
           "press <command key>", release everything.

Perhaps your keyboard is one of those that can't cope with all those
keys?

Ian.

-- 
Ian Campbell, Senior Design Engineer
                                        Web: http://www.arcom.com
Arcom, Clifton Road,                    Direct: +44 (0)1223 403 465
Cambridge CB1 7EA, United Kingdom       Phone:  +44 (0)1223 411 200


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:39           ` Gene Heskett
  2004-11-04 13:01             ` Ian Campbell
@ 2004-11-04 13:10             ` Doug McNaught
  2004-11-04 14:11               ` Gene Heskett
  2004-11-04 20:18             ` Bill Davidsen
  2 siblings, 1 reply; 99+ messages in thread
From: Doug McNaught @ 2004-11-04 13:10 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Jan Knutar, Tom Felker

Gene Heskett <gene.heskett@verizon.net> writes:

> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
> CONFIG_MAGIC_SYSRQ=y

Did you also enable it in /proc? 

-Doug

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:29             ` Jan Knutar
@ 2004-11-04 13:56               ` Gene Heskett
  0 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 13:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jan Knutar, Tom Felker

On Thursday 04 November 2004 07:29, Jan Knutar wrote:
>On Thursday 04 November 2004 14:18, Gene Heskett wrote:
>> The machine is hung.  No ssh, no ping response, the only button
>> that works is the hardware reset on the front of the tower.
>
>I must've missed where the thread went from zombies into totally
> hung machine. My apologies for the noise.

It went from an unkillable process (gnomeradio) that was blocking 
other programs like tvtime with its locks on /dev/video0, to 
completely hung at "stopping alsasound" when I tried to reboot.  That 
required the reset button to get going again.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 13:01             ` Ian Campbell
@ 2004-11-04 14:07               ` Gene Heskett
  2004-11-04 14:24                 ` Ian Campbell
  2004-11-04 14:26                 ` DervishD
  0 siblings, 2 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 14:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ian Campbell, Jan Knutar, Tom Felker

On Thursday 04 November 2004 08:01, Ian Campbell wrote:
>On Thu, 2004-11-04 at 07:39 -0500, Gene Heskett wrote:
>> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
>> >On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>> >> I'e had that turned on since forever Jan, but usually, when its
>> >> hung someplace, its well and truely hung, and hardware reset
>> >> button time.
>> >
>> >Are you saying that these zombies (or tasks stuck in state D)
>> > also make sysrq-T hang, and not list all tasks?
>>
>> I thought I'd test it right now while the system is runnng
>> normally, but I got only a beep from the console, so I went to
>> Documentation/sysrq.txt to make sure I was doing it right, and it
>> is _not_ working right now.  But it is compiled in according to a
>> make xconfig, or a grep of the .config.
>
>It can also be enabled/disabled at runtime, Documentation/sysrq.txt
> says that the default now is on (but that it used to default to
> off). Perhaps it is getting turned off somewhere in your boot
> scripts etc.
>
>You can check with
>
>$ cat /proc/sys/kernel/sysrq
>1
>
>> The keyboard is a cheap ($24) M$ with a few extra buttons that
>> don't do anything along the top.  And getting a bit creaky in its
>> old age, a lot like me, but I'm about 68 years older than the
>> keyboard :)
>
>Documentation/sysrq.txt also says:
>
>*  How do I use the magic SysRq key?
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>On x86   - You press the key combo 'ALT-SysRq-<command key>'. Note -
> Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key
> is also known as the 'Print Screen' key. Also some keyboards cannot
> handle so many keys being pressed at the same time, so you might
> have better luck with "press Alt", "press SysRq", "release Alt",
> "press <command key>", release everything.
>
>Perhaps your keyboard is one of those that can't cope with all those
>keys?
>
>Ian.

Possibly, but OTOH,
[root@coyote root]#  cat /proc/sys/kernel/sysrq
0

And no, I'm not turning it off anyplace in the boot proceedure.  An
'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses now
gets a boatload of stuff in the logs, but nothing on the console.

The logs look something like this:

Nov  4 08:59:29 coyote kernel: kdeinit       S C0453F08     0 18964   3327         18965 18963 (NOTLB)
Nov  4 08:59:29 coyote kernel: c657ae8c 00200082 c6820120 c0453f08 0000202c 00000000 b4d18366 0000202c
Nov  4 08:59:29 coyote kernel:        00002ecd b4d1e78f 0000202c c6820600 c682075c 0217d045 c657aea0 fffffff5
Nov  4 08:59:29 coyote kernel:        c657aedc c033bca3 c657aea0 0217d045 c657aec4 dfa88ea0 ee3aeea0 0217d045
Nov  4 08:59:29 coyote kernel: Call Trace:
Nov  4 08:59:29 coyote kernel:  [<c033bca3>] schedule_timeout+0x63/0xc0
Nov  4 08:59:29 coyote kernel:  [<c0120150>] process_timeout+0x0/0x10
Nov  4 08:59:29 coyote kernel:  [<c012c12f>] futex_wait+0x12f/0x1a0
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c012c418>] do_futex+0x48/0xa0
Nov  4 08:59:29 coyote kernel:  [<c012c55e>] sys_futex+0xee/0x100
Nov  4 08:59:29 coyote kernel:  [<c01040a9>] sysenter_past_esp+0x52/0x71
Nov  4 08:59:29 coyote kernel: kdeinit       S C0453A60     0 18965   3327         18966 18964 (NOTLB)
Nov  4 08:59:29 coyote kernel: dfa88e8c 00200082 c6820120 c0453a60 dfa88eac 00000000 ed99e990 00000000
Nov  4 08:59:29 coyote kernel:        00006be7 b816258b 0000202c c6820120 c682027c 0217d07c dfa88ea0 fffffff5
Nov  4 08:59:29 coyote kernel:        dfa88edc c033bca3 dfa88ea0 0217d07c dfa88ec4 c0459928 c657aea0 0217d07c
Nov  4 08:59:29 coyote kernel: Call Trace:
Nov  4 08:59:29 coyote kernel:  [<c033bca3>] schedule_timeout+0x63/0xc0
Nov  4 08:59:29 coyote kernel:  [<c0120150>] process_timeout+0x0/0x10
Nov  4 08:59:29 coyote kernel:  [<c012c12f>] futex_wait+0x12f/0x1a0
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c012c418>] do_futex+0x48/0xa0
Nov  4 08:59:29 coyote kernel:  [<c012c55e>] sys_futex+0xee/0x100
Nov  4 08:59:29 coyote kernel:  [<c01040a9>] sysenter_past_esp+0x52/0x71
Nov  4 08:59:29 coyote kernel: kdeinit       S C0453A60     0 18966   3327               18965 (NOTLB)
Nov  4 08:59:29 coyote kernel: ee3aee8c 00200082 e770fb00 c0453a60 ee3aeeac 00000000 ed99e990 00000000
Nov  4 08:59:29 coyote kernel:        00001e29 b4b250fe 0000202c e770fb00 e770fc5c 0217d043 ee3aeea0 fffffff5
Nov  4 08:59:29 coyote kernel:        ee3aeedc c033bca3 ee3aeea0 0217d043 666c6573 c657aea0 c039be78 0217d043
Nov  4 08:59:29 coyote kernel: Call Trace:
Nov  4 08:59:29 coyote kernel:  [<c033bca3>] schedule_timeout+0x63/0xc0
Nov  4 08:59:29 coyote kernel:  [<c0120150>] process_timeout+0x0/0x10
Nov  4 08:59:29 coyote kernel:  [<c012c12f>] futex_wait+0x12f/0x1a0
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c0114160>] default_wake_function+0x0/0x20
Nov  4 08:59:29 coyote kernel:  [<c012c418>] do_futex+0x48/0xa0
Nov  4 08:59:29 coyote kernel:  [<c012c55e>] sys_futex+0xee/0x100
Nov  4 08:59:29 coyote kernel:  [<c01040a9>] sysenter_past_esp+0x52/0x71

There is a lot more of that of that above that snip, several pages.
And of course the system seems to be running fine ATM. :-)
But I'm learning, and that echo will go into my rc.local as soon as
I'm done here.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 13:10             ` Doug McNaught
@ 2004-11-04 14:11               ` Gene Heskett
  2004-11-04 14:42                 ` tlaurent
  0 siblings, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Doug McNaught, Jan Knutar, Tom Felker

On Thursday 04 November 2004 08:10, Doug McNaught wrote:
>Gene Heskett <gene.heskett@verizon.net> writes:
>> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
>> CONFIG_MAGIC_SYSRQ=y
>
>Did you also enable it in /proc?
>
>-Doug

I just now discovered it defaults to a 0, so I put an 
echo 1 >proc/sys/kermel/sysrq
in rc.local just now.

Thanks for the heads up.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:26                     ` DervishD
@ 2004-11-04 14:23                       ` Paul Slootman
  2004-11-04 14:56                         ` Gene Heskett
  2004-11-04 18:24                         ` DervishD
  2004-11-04 19:22                       ` Bill Davidsen
  1 sibling, 2 replies; 99+ messages in thread
From: Paul Slootman @ 2004-11-04 14:23 UTC (permalink / raw)
  To: linux-kernel

DervishD  <lkml@dervishd.net> wrote:
>
>    If init is the parent, all works ok, just wait a bit and all
>those zombies will really die ;)

I recently had a system with serial console where some some reason the
serial port was stopped. This meant that init blocked while writing some
message (e.g. "respawning too rapidly"), and that meant it stopped
reaping those zombie processes. The list of these zombie processes with
PPID == 1 was amazing. The only thing that helped was rebooting after
replacing the serial console cable.

(Kernel 2.4.25, sysvinit 2.85 in case you're wondering.)


Paul Slootman


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:07               ` Gene Heskett
@ 2004-11-04 14:24                 ` Ian Campbell
  2004-11-04 15:10                   ` Gene Heskett
  2004-11-04 14:26                 ` DervishD
  1 sibling, 1 reply; 99+ messages in thread
From: Ian Campbell @ 2004-11-04 14:24 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Jan Knutar, Tom Felker

On Thu, 2004-11-04 at 09:07 -0500, Gene Heskett wrote:
> [root@coyote root]#  cat /proc/sys/kernel/sysrq
> 0

Aha :-)

> And no, I'm not turning it off anyplace in the boot proceedure.

Something must be -- you can see in drivers/char/sysrq.c that
sysrq_enabled is set to 1 by default and according to bkbits.net it has
been that way since at least 2.4.0.

does the following not come up with any culprits?
	# grep -r sysrq /etc

Ian.

-- 
Ian Campbell, Senior Design Engineer
                                        Web: http://www.arcom.com
Arcom, Clifton Road,                    Direct: +44 (0)1223 403 465
Cambridge CB1 7EA, United Kingdom       Phone:  +44 (0)1223 411 200


_____________________________________________________________________
The message in this transmission is sent in confidence for the attention of the addressee only and should not be disclosed to any other party. Unauthorised recipients are requested to preserve this confidentiality. Please advise the sender if the addressee is not resident at the receiving end.  Email to and from Arcom is automatically monitored for operational and lawful business reasons.

This message has been virus scanned by MessageLabs.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:07               ` Gene Heskett
  2004-11-04 14:24                 ` Ian Campbell
@ 2004-11-04 14:26                 ` DervishD
  2004-11-04 15:13                   ` Gene Heskett
  1 sibling, 1 reply; 99+ messages in thread
From: DervishD @ 2004-11-04 14:26 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Ian Campbell, Jan Knutar, Tom Felker

    Hi Gene :)

 * Gene Heskett <gene.heskett@verizon.net> dixit:
> Possibly, but OTOH,
> [root@coyote root]#  cat /proc/sys/kernel/sysrq
> 0
> 
> And no, I'm not turning it off anyplace in the boot proceedure.  An
> 'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses now
> gets a boatload of stuff in the logs, but nothing on the console.

    Well, the stuff goes to the logs and not the console because of
the console log level. You can change that using proc, too. Look in
/proc/sys/kernel/printk (well, at least under 2.4.x). You'll see
four numbers. The first one is the console loglevel. Any message
directed to syslog with a priority higher than this number will be
printed in the console. Otherwise they won't.

    The second number is the default message level. Any message
without a priority will get this priority.

    The third number is the highest value you can assign to the first
number (the console loglevel).

    The fourth number is the default value for the first number.

    The interesting number for you is the first one. Set it to a
correct value for you (see syslog(2) to see what the numbers mean).

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:11               ` Gene Heskett
@ 2004-11-04 14:42                 ` tlaurent
  2004-11-04 15:14                   ` Gene Heskett
  0 siblings, 1 reply; 99+ messages in thread
From: tlaurent @ 2004-11-04 14:42 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Doug McNaught, Jan Knutar, Tom Felker

Selon Gene Heskett <gene.heskett@verizon.net>: 
 
> On Thursday 04 November 2004 08:10, Doug McNaught wrote: 
> >Gene Heskett <gene.heskett@verizon.net> writes: 
> >> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config 
> >> CONFIG_MAGIC_SYSRQ=y 
> > 
> >Did you also enable it in /proc? 
> > 
> >-Doug 
>  
> I just now discovered it defaults to a 0, so I put an  
> echo 1 >proc/sys/kermel/sysrq 
> in rc.local just now. 
 
You might also want to have a look at /etc/sysctl.conf. Some distros put a 
kernel.sysrq=0 in it... 
 
Cheers, 
Thibaut 
 
>  
> Thanks for the heads up. 
>  
> --  
> Cheers, Gene 
> "There are four boxes to be used in defense of liberty: 
>  soap, ballot, jury, and ammo. Please use in that order." 
> -Ed Howdershelt (Author) 
> 99.28% setiathome rank, not too shabby for a WV hillbilly 
> Yahoo.com attorneys please note, additions to this message 
> by Gene Heskett are: 
> Copyright 2004 by Maurice Eugene Heskett, all rights reserved. 
 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:23                       ` Paul Slootman
@ 2004-11-04 14:56                         ` Gene Heskett
  2004-11-04 18:24                         ` DervishD
  1 sibling, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 14:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Slootman

On Thursday 04 November 2004 09:23, Paul Slootman wrote:
>DervishD  <lkml@dervishd.net> wrote:
>>    If init is the parent, all works ok, just wait a bit and all
>>those zombies will really die ;)
>
>I recently had a system with serial console where some some reason
> the serial port was stopped. This meant that init blocked while
> writing some message (e.g. "respawning too rapidly"), and that
> meant it stopped reaping those zombie processes. The list of these
> zombie processes with PPID == 1 was amazing. The only thing that
> helped was rebooting after replacing the serial console cable.
>
>(Kernel 2.4.25, sysvinit 2.85 in case you're wondering.)

Both serial ports are already in use here Paul, one for heyu and x10 
stuff related to my home automation (mostly the outside lights), and 
the other to my Belkin ups, whose usb interface has never worked, so 
I'm stuck using serial for the BullDog interface to gkrellm.  I'd 
like to find a cheap pci rocketport as I have another vintage box in 
the basement that could use this machine as a network gateway then.  
Right now its on PL2303 usb<->serial convertor but somethings wrong 
with the handshaking on that end.

>Paul Slootman
>
>-
>To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:24                 ` Ian Campbell
@ 2004-11-04 15:10                   ` Gene Heskett
  0 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 15:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ian Campbell, Jan Knutar, Tom Felker

On Thursday 04 November 2004 09:24, Ian Campbell wrote:
>grep -r sysrq /etc

Gets me a bunch.  The revelant ones would be:
/etc/rc.d/rc3.d/K20iscsi:           if [ -e /proc/sys/kernel/sysrq ] ; then
/etc/rc.d/rc3.d/K20iscsi:               echo "1" > /proc/sys/kernel/sysrq

and
/etc/rc.d/rc.local:# Turn on the magic sysrq keys
/etc/rc.d/rc.local:echo 1 >/proc/sys/kernel/sysrq

But, what about this:
/etc/sysctl.conf:# Disables the magic-sysrq key
/etc/sysctl.conf:kernel.sysrq = 0
which I just commented out...

And this:
/etc/linuxconf/archive/Office/etc/sysctl.conf,v:kernel.sysrq = 0
But everything there is dated early 2001.  I think its filesystem
cruft nowadays, subject to being a space patrol target eventually.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:26                 ` DervishD
@ 2004-11-04 15:13                   ` Gene Heskett
  0 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 15:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: DervishD, Ian Campbell, Jan Knutar, Tom Felker

On Thursday 04 November 2004 09:26, DervishD wrote:
>    Hi Gene :)
>
> * Gene Heskett <gene.heskett@verizon.net> dixit:
>> Possibly, but OTOH,
>> [root@coyote root]#  cat /proc/sys/kernel/sysrq
>> 0
>>
>> And no, I'm not turning it off anyplace in the boot proceedure. 
>> An 'echo 1 >/proc/sys/kernel/sysrq', and repeating the keypresses
>> now gets a boatload of stuff in the logs, but nothing on the
>> console.
>
>    Well, the stuff goes to the logs and not the console because of
>the console log level. You can change that using proc, too. Look in
>/proc/sys/kernel/printk (well, at least under 2.4.x). You'll see
>four numbers. The first one is the console loglevel. Any message
>directed to syslog with a priority higher than this number will be
>printed in the console. Otherwise they won't.
>
>    The second number is the default message level. Any message
>without a priority will get this priority.
>
>    The third number is the highest value you can assign to the
> first number (the console loglevel).
>
>    The fourth number is the default value for the first number.
>
>    The interesting number for you is the first one. Set it to a
>correct value for you (see syslog(2) to see what the numbers mean).
>
>    Raúl Núñez de Arenas Coronado

I have it going to the logs as the prefered method as thats permanent 
whereas the console output is 100% volatile.  That way I can look at 
the logs when the machine has been made functional again.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:42                 ` tlaurent
@ 2004-11-04 15:14                   ` Gene Heskett
  0 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 15:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: tlaurent, gene.heskett, linux-kernel, Doug McNaught, Jan Knutar,
	Tom Felker

On Thursday 04 November 2004 09:42, tlaurent@linagora.com wrote:
>Selon Gene Heskett <gene.heskett@verizon.net>:
>> On Thursday 04 November 2004 08:10, Doug McNaught wrote:
>> >Gene Heskett <gene.heskett@verizon.net> writes:
>> >> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
>> >> CONFIG_MAGIC_SYSRQ=y
>> >
>> >Did you also enable it in /proc?
>> >
>> >-Doug
>>
>> I just now discovered it defaults to a 0, so I put an
>> echo 1 >proc/sys/kermel/sysrq
>> in rc.local just now.
>
>You might also want to have a look at /etc/sysctl.conf. Some distros
> put a kernel.sysrq=0 in it...

And I just put a comment in front of that puppy!

>Cheers,
>Thibaut
>
>> Thanks for the heads up.
>>
>> --
>> Cheers, Gene
>> "There are four boxes to be used in defense of liberty:
>>  soap, ballot, jury, and ammo. Please use in that order."
>> -Ed Howdershelt (Author)
>> 99.28% setiathome rank, not too shabby for a WV hillbilly
>> Yahoo.com attorneys please note, additions to this message
>> by Gene Heskett are:
>> Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 16:47       ` Gene Heskett
  2004-11-03 17:44         ` DervishD
@ 2004-11-04 16:01         ` kernel
  2004-11-04 16:18           ` Gene Heskett
  1 sibling, 1 reply; 99+ messages in thread
From: kernel @ 2004-11-04 16:01 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, DervishD, Måns Rullgård

On Wed, 2004-11-03 at 11:47, Gene Heskett wrote:
> Finding them is usually an exersize in stretching the 
> top window out till its about 20 screens high as its always going to 
> be at the bottom of the list.

use 'htop' instead, more flexible in showing and parsing.


-fd




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 16:01         ` kernel
@ 2004-11-04 16:18           ` Gene Heskett
  2004-11-04 16:47             ` kernel
  0 siblings, 1 reply; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 16:18 UTC (permalink / raw)
  To: linux-kernel, kernel; +Cc: DervishD, Måns Rullgård

On Thursday 04 November 2004 11:01, kernel wrote:
>On Wed, 2004-11-03 at 11:47, Gene Heskett wrote:
>> Finding them is usually an exersize in stretching the
>> top window out till its about 20 screens high as its always going
>> to be at the bottom of the list.
>
>use 'htop' instead, more flexible in showing and parsing.
>
And where is htop, it apparently isn't part of an FC2 install.
>
>-fd

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 22:15               ` Jim Nelson
  2004-11-03 22:44                 ` Russell Miller
@ 2004-11-04 16:30                 ` Pedro Venda (SYSADM)
  2004-11-04 22:28                   ` Helge Hafting
  1 sibling, 1 reply; 99+ messages in thread
From: Pedro Venda (SYSADM) @ 2004-11-04 16:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jim Nelson

Jim Nelson wrote:
> DervishD wrote:
> 
>>     Hi Gene :)
>>
>>  * Gene Heskett <gene.heskett@verizon.net> dixit:
>>
>>>>   Then the children are reparented to 'init' and 'init' gets rid
>>>> of them. That's the way UNIX behaves.
>>>
>>>
>>> Unforch, I've *never* had it work that way.  Any dead process I've 
>>> ever had while running linux has only been disposable by a reboot.
>>
>>
>>
>>     Well, you know, shit happens... Anyway, could you define 'dead'?
>> Because if you're talking about zombies whose parent dies, they're
>> killable easily: just wait until init reaps them (usually in less
>> than 5 minutes since they dead). If you are talking about zombies who
>> has their parent alive, then it's a bug in the application, not the
>> kernel. In fact I wouldn't like if the kernel reaps my children
>> before I do, just in case I want to do something.
>>
>>     If you're talking about unkillable processes (those stuck in
>> disk-sleep state), you're right: only rebooting can kill them
>> (although sometimes they go out of D state and die normally). Bad
>> luck for you if any dead process you've ever had while running linux
>> has been of this kind :(
>>
> 
> I did this to myself a number of times when I was first learning Samba - 
> even an ls would become unkillable.  You couldn't rmmod smb, since it 
> was in use, and you couldn't kill the process, since it was waiting on a 
> syscall.  Ergh.

the exact same happened to me, but my case was with ntfs. zip processes 
just got stuch in "D" state because of some unhandled names... i 
couldn't kill the processes. i don't think this is an easy thing to do, 
tough it should be possible to kill -9 these processes and make them exit.

is this feasible?

regards,
pedro venda.
-- 

Pedro João Lopes Venda
email: pjvenda@rnl.ist.utl.pt
http://maxwell.rnl.ist.utl.pt

Equipa de Administração de Sistemas
Rede das Novas Licenciaturas (RNL)
Instituto Superior Técnico
http://www.rnl.ist.utl.pt
http://mega.ist.utl.pt

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 16:18           ` Gene Heskett
@ 2004-11-04 16:47             ` kernel
  2004-11-04 17:58               ` Gene Heskett
  0 siblings, 1 reply; 99+ messages in thread
From: kernel @ 2004-11-04 16:47 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, DervishD, Måns Rullgård

On Thu, 2004-11-04 at 11:18, Gene Heskett wrote:

> And where is htop, it apparently isn't part of an FC2 install.
> >


http://htop.sourceforge.net/

from site above;
Comparison between htop and top
      * In 'htop' you can scroll the list vertically and horizontally to
        see all processes and complete command lines.
      * In 'top' you are subject to a delay for each unassigned key you
        press (especially annoying when multi-key escape sequences are
        triggered by accident).
      * 'htop' starts faster ('top' seems to collect data for a while
        before displaying anything).
      * In 'htop' you don't need to type the process number to kill a
        process, in 'top' you do.
      * In 'htop' you don't need to type the process number or the
        priority value to renice a process, in 'top' you do.
      * 'htop' supports mouse operation, 'top' doesn't
      * 'top' is older, hence, more used and tested.



cheers!

-fd


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:04                   ` Helge Hafting
@ 2004-11-04 17:16                     ` Alex Bennee
  0 siblings, 0 replies; 99+ messages in thread
From: Alex Bennee @ 2004-11-04 17:16 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Russell Miller, Jim Nelson, DervishD, Gene Heskett,
	Linux Kernel Mailing List, Måns Rullgård

On Thu, 2004-11-04 at 10:04, Helge Hafting wrote:
> Russell Miller wrote:
> >On Wednesday 03 November 2004 16:15, Jim Nelson wrote:
> >
> >Anyway, is there a way to simply signal a syscall that it is to be interrupted 
> >and forcibly cause the syscall to end? 
> >
> There is a way.  Processes go into D state happens all the time
> when waiting for disk io or similiar.  Then the io happens a few ms later,
> and the fs or device driver tells the kernel to wake up the process
> so it gets a chance at the next scheduling opportunity. So the mechanism to
> unstick a prcess exists, and is used by every device driver that
> use sleeping.  Which is most of them.
> 
> Breakage happens when something never comes out of D-state.
> One could write a trivial syscall (or addition to "kill") that "wakes"
> processes waiting for io.  It itsn't hard to do at all - just copy the
> waking code from any device driver.  This will allow to kill and
> fully remove any process that hangs around in D-state.  This might
> also release other stuck resources as the syscall
> continues, returns to userspace, and allows the process to die.
> 
> Unfortunately, this isn't enough.  In some cases the syscall
> expects the io device interrupt handler to have done something
> vital - but this haven't happened when we forcibly wakes a process.
> We can hope for an io error, but might get a crash instead. This
> can be fixes with a lot of work - basically check at every wakeup
> if the process were woken by this new killing mechanism and
> act accordingly.  It shouldn't be hard, but _lots_ of work
> inspecting every sleeping point, at least every device driver.

Timeouts and interruptible sleeps are the two ways to solve the problem.
All good drivers should have covering timeouts in case the event they
where hoping for never happens.

If the code path that assumes magic has happened after it wakes up
doesn't check its not defensive enough. Also you can make tasks
interruptible so signals can get through:

result = wait_event_interruptible(dev->waitq,dev_irq_event(dev));
      
if (result) {
     printk(KERN_ALERT "dev_irq_wait: Interrupted by a signal\n");
     return -ERESTARTSYS;
};

As you have noted you can't always make things interruptible, but decent
timeouts should always exist. Hardware has bugs too!
-- 
Alex, Kernel Hacker: http://www.bennee.com/~alex/

In English, every word can be verbed.  Would that it were so in our
programming languages.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 16:47             ` kernel
@ 2004-11-04 17:58               ` Gene Heskett
  0 siblings, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-04 17:58 UTC (permalink / raw)
  To: linux-kernel, kernel; +Cc: DervishD, Måns Rullgård

On Thursday 04 November 2004 11:47, kernel wrote:
>On Thu, 2004-11-04 at 11:18, Gene Heskett wrote:
>> And where is htop, it apparently isn't part of an FC2 install.
>
>http://htop.sourceforge.net/
>
Thanks, got it.  Looks good, more thanks...

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 14:23                       ` Paul Slootman
  2004-11-04 14:56                         ` Gene Heskett
@ 2004-11-04 18:24                         ` DervishD
  1 sibling, 0 replies; 99+ messages in thread
From: DervishD @ 2004-11-04 18:24 UTC (permalink / raw)
  To: Paul Slootman; +Cc: linux-kernel

    Hi Paul :)

 * Paul Slootman <paul+nospam@wurtel.net> dixit:
> >    If init is the parent, all works ok, just wait a bit and all
> >those zombies will really die ;)
> I recently had a system with serial console where some some reason the
> serial port was stopped. This meant that init blocked while writing some
> message (e.g. "respawning too rapidly"), and that meant it stopped
> reaping those zombie processes. The list of these zombie processes with
> PPID == 1 was amazing. The only thing that helped was rebooting after
> replacing the serial console cable.

    It looks like a bug in sysvinit: it shouldn't print anything on
the console but use syslog and specify that the console NEVER shall
be used to print anything even when there is no syslogd running. I'll
make sure that it doesn't happen in my VCinit.

    Thanks for the information :)

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:26                     ` DervishD
  2004-11-04 14:23                       ` Paul Slootman
@ 2004-11-04 19:22                       ` Bill Davidsen
  2004-11-04 20:53                         ` DervishD
  1 sibling, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 19:22 UTC (permalink / raw)
  To: DervishD
  Cc: Gene Heskett, linux-kernel, Valdis.Kletnieks, Måns Rullgård

DervishD wrote:
>     Hi Bill :)
> 
>  * Bill Davidsen <davidsen@tmr.com> dixit:
> 
>>>   I think that the parent (which is whatever process did the fork
>>>when you clicked your mouse) is still alive and forgetting to do the
>>>'wait()' for its children.
>>
>>It would be good to know what the PPID is, from ps or similar. Things 
>>from X are a pain, the parent is often something you don't want to kill. 
>>Sometimes you can reparent from command line, "bash -c foo&" or similar, 
>>so the parent can be killed without logging out.
> 
> 
>     Just use ps to reveal the family tree. Is not that hard ;)

That's what I just said, the original poster should tell us what the 
PPID is, which may help someone help the OP.
>  
> 
>>I would swear that the parent *is* init in some cases, which is puzzling 
>>since they should be reaped.
> 
> 
>     But that's OK :))) When a parent dies without waiting for its
> children, the zombies are reparented to init. That's correct. Then
> init will wait for them. The problem is that sometimes the signals
> doesn't arrive or the like. Then the zombies are laying around a bit,
> until a timer in 'init' reaps them. That's correct too: init can only
> wait for children when it receives SIGCHLD or periodically, using a
> timer. I've written a init program and that's the way I do it, just
> in case some signal gets lost.
> 
>     If init is the parent, all works ok, just wait a bit and all
> those zombies will really die ;)

Actually the ones in i/o probably won't, since the kernel either missed 
the completion or didn't time out if the hardware missed sending the 
int. And even plain non-i/o zombies, just how long "a bit" are you 
proposing?

Over Thanksgiving weekend I will try to look at the init code and see if 
a signal could be used to initiate a forced reap without waiting for the 
timer. By "look at" I mean not only "could I do that" but is it a good 
thing to do, before someone starts trying to explain that it's going to 
do something evil not to wait for the timer...

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:09                   ` Gene Heskett
@ 2004-11-04 19:24                     ` Bill Davidsen
  0 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 19:24 UTC (permalink / raw)
  To: gheskett
  Cc: linux-kernel, Valdis.Kletnieks, DervishD, Måns Rullgård

Gene Heskett wrote:
> On Wednesday 03 November 2004 14:33, Valdis.Kletnieks@vt.edu wrote:
> 
>>On Wed, 03 Nov 2004 14:26:23 EST, Gene Heskett said:
>>
>>>Well, since the "device", a bt878 based Haupagge tv card is
>>>sitting in a pci socket, thats even more drastic than a reboot.
>>
>>Not if you have a good hot-swap PCI cage. ;)
>>
>>Anyhow, that points even more at a driver issue for the bt878 -
>>if you can get Sysrq-T output, where does it say the hung process is
>>inside the kernel?
> 
> 
> Thats another thing I've had compiled in since forever, but it so 
> seldom actually *works*, I've tended to forget about it.
> 
You have it enabled as well as compiled in, I'm sure.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:23         ` DervishD
@ 2004-11-04 19:32           ` Bill Davidsen
  2004-11-04 21:11             ` DervishD
  0 siblings, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 19:32 UTC (permalink / raw)
  To: DervishD; +Cc: Måns Rullgård, linux-kernel

DervishD wrote:
>     Hi Bill :)
> 
>  * Bill Davidsen <davidsen@tmr.com> dixit:
> 
>>>   Or write a little program that just 'wait()'s for the specified
>>>PID's. That is perfectly portable IMHO. But I must admit that the
>>>preferred way should be killing the parent. 'init' will reap the
>>>children after that.
>>
>>You can't wait() for the process, you have to use waitfor(), and the 
>>last time I tried that it didn't work, although I don't remember the 
>>symptom beyond that.
> 
> 
>     You can't wait for other's children. OTOH, if we talk about your
> children, you can do wait() or waitpid() (I assume that you referred
> to waitpid(), since there isn't waitfor() AFAIK). The only difference
> is that wait suspends the process until information from a child is
> available.

Yes, thank you, I was thinking "wait for the PID" and typed that.
> 
>     If you are talking about others' children, then your call to
> waitpid() (or wait()) failed with ECHILD: not your child.

That's what happened when I tried it a few months ago. I suppose one 
could try sending a SIGCHLD to the parent and see if it does something 
helpful.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 23:33                     ` Russell Miller
  2004-11-03 23:47                       ` Mathieu Segaud
  2004-11-04  6:39                       ` Denis Vlasenko
@ 2004-11-04 20:06                       ` Bill Davidsen
  2 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 20:06 UTC (permalink / raw)
  To: Russell Miller
  Cc: Doug McNaught, Jim Nelson, DervishD, Gene Heskett, linux-kernel,
	Måns Rullgård

Russell Miller wrote:
> On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
> 
> 
>>It was already mentioned in this thread that the bookkeeping required
>>to clean up properly from such an abort would add a lot of overhead
>>and slow down the normal, non-buggy case.
>>
> 
> I am going to continue pursuing this at the risk of making a bigger fool of 
> myself than I already am, but I want to make sure that I understand the 
> issues - and I did read the message you are referring to.
> 
> I think what you are saying is that there is kind of a race condition here.

At least in the usual sense, no. There is a condition from which there 
is no graceful way back, only forward.

> When something is on the wait queue, it has to be followed through to 
> completion.  An interrupt could be received at any time, and if it's taken 
> off of the wait queue prematurely, it'll crash the kernel, because the 
> interrupt has no way of telling that.

That's part of it, but in some cases there's also i/o in progress, the 
hardware may not have a way to HALT the transfer, so the memory in 
question can't be used for something else.
> 
> That's fine as it goes, I understand that.  But I submit that this is a 
> horrible design.  I've been bitten by this more than once - usually regarding 
> broken NFS connections.
> 
> But what I don't understand is why the bookkeeping would be so inefficient.  
> It seems to me that all that would be required is a bitfield of some sort.  
> If that position in the qait queue becomes invalid, when the interrupt is 
> received to process it, the kernel notes that a flag is set invalidating that 
> part of the wait queue, dumps the output to dave null, and goes on to the 
> next.  This doesn't seem inefficient to me, unless I'm missing something.
> A little more inefficient, yes, but not to near the cost that seems to be 
> implied.
> 
> And I also have to ask this question:  what is more inefficient, slowing down 
> processing of output waiting on the queue, or having to reboot when a process 
> gets stuck due to faulty drivers?  At the very least, a compile option seems 
> like it would be worthwhile for those that would like this behavior.
> 
> And I probably am.  Missing something, that is.

You are asking to program around a problem rather than fix it. These 
hangs (usually) happen because the hardware behaviour is either 
undocumented, incorrectly documented, or flat out broken. Second likely 
cause is a bug in the driver.

In the case of a real bug, adding code to bypass the error instead of 
fixing it is more effort, more complex in most cases, and therefore less 
reliable. Where the hardware does something unexpected, the driver needs 
to fit the behaviour rather than the spec. And where the hardware is 
broken, you fix or replace it. None of those cases suggest "pretend it 
didn't happen," because in most cases you can't.

What I think you are missing:

Processes hung in D state are the result of real problems, and ignoring 
rather than fixing them is like giving a cancer patient a face lift; it 
doesn't fix the problem, it just gives you a good looking corpse.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  1:59                 ` Mitchell Blank Jr
@ 2004-11-04 20:10                   ` Bill Davidsen
  0 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 20:10 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: Russell Miller, linux-kernel

Mitchell Blank Jr wrote:
> Russell Miller wrote:
> 
>>Couldn't ring 1 be used to make 
>>sure an errant driver doesn't drop the kernel, at least on x86 machines?
> 
> 
> Not really -- drivers could still do things like mis-program their associated
> hardware making it do DMA writes all over kernel memory (just as one example)
> 
> Basically it'd add a lot of complexity (and inefficiency) without adding
> much real safety.

It would be nice on x86 to run ring 1 for kernel debugging, getting 
faults at appropriate points. Sorry, I'm an old MULTICS guy, wish 
Honeywell would OS it.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 12:39           ` Gene Heskett
  2004-11-04 13:01             ` Ian Campbell
  2004-11-04 13:10             ` Doug McNaught
@ 2004-11-04 20:18             ` Bill Davidsen
  2 siblings, 0 replies; 99+ messages in thread
From: Bill Davidsen @ 2004-11-04 20:18 UTC (permalink / raw)
  To: gene.heskett; +Cc: linux-kernel, Jan Knutar, Tom Felker

Gene Heskett wrote:
> On Thursday 04 November 2004 07:12, Jan Knutar wrote:
> 
>>On Thursday 04 November 2004 13:57, Gene Heskett wrote:
>>
>>>I'e had that turned on since forever Jan, but usually, when its
>>>hung someplace, its well and truely hung, and hardware reset
>>>button time.
>>
>>Are you saying that these zombies (or tasks stuck in state D) also
>>make sysrq-T hang, and not list all tasks?
> 
> 
> I thought I'd test it right now while the system is runnng normally, 
> but I got only a beep from the console, so I went to 
> Documentation/sysrq.txt to make sure I was doing it right, and it is 
> _not_ working right now.  But it is compiled in according to a make 
> xconfig, or a grep of the .config.
> 
> [root@coyote linux-2.6.10-rc1-bk13]# grep SYSRQ .config
> CONFIG_MAGIC_SYSRQ=y
> 
> I get a couple of beeps from the console, but thats the limit of the 
> response, and a tail -f on the log shows nothing.  I also logged into  
> VC2, and tried it there, but that attempt didn't even get me a beep, 
> several times.
> 
> The keyboard is a cheap ($24) M$ with a few extra buttons that don't 
> do anything along the top.  And getting a bit creaky in its old age, 
> a lot like me, but I'm about 68 years older than the keyboard :)
> 
Don't need to log in, do need two hands to hit all the keys at once;-) 
It works for me on a VC and unhung system, but I agree, when the system 
is well and truly hung reset is the only thing left.


-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 19:22                       ` Bill Davidsen
@ 2004-11-04 20:53                         ` DervishD
  0 siblings, 0 replies; 99+ messages in thread
From: DervishD @ 2004-11-04 20:53 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Gene Heskett, linux-kernel, Valdis.Kletnieks, Måns Rullgård

    Hi Bill :)

 * Bill Davidsen <davidsen@tmr.com> dixit:
> >    If init is the parent, all works ok, just wait a bit and all
> >those zombies will really die ;)
> Actually the ones in i/o probably won't, since the kernel either missed 
> the completion or didn't time out if the hardware missed sending the 
> int. And even plain non-i/o zombies, just how long "a bit" are you 
> proposing?

    A zombie *is already dead*, not stuck in some uninterruptible
queue in the kernel, so they will be ripped, sure. My last sentence
in the paragraph above may be confusing: when I said 'really die' I
meant 'be ripped'?

> Over Thanksgiving weekend I will try to look at the init code and see if 
> a signal could be used to initiate a forced reap without waiting for the 
> timer. By "look at" I mean not only "could I do that" but is it a good 
> thing to do, before someone starts trying to explain that it's going to 
> do something evil not to wait for the timer...

    Don't look: just send SIGCHLD to init. That will do.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 19:32           ` Bill Davidsen
@ 2004-11-04 21:11             ` DervishD
  2004-11-09 23:31               ` Bill Davidsen
  0 siblings, 1 reply; 99+ messages in thread
From: DervishD @ 2004-11-04 21:11 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Måns Rullgård, linux-kernel

    Hi Bill :)

 * Bill Davidsen <davidsen@tmr.com> dixit:
> >    If you are talking about others' children, then your call to
> >waitpid() (or wait()) failed with ECHILD: not your child.
> That's what happened when I tried it a few months ago. I suppose one 
> could try sending a SIGCHLD to the parent and see if it does something 
> helpful.

    Probably it won't do. If the zombies are there due to a signal
delivery problem, sending a SIGCHLD to the parent will (probably)
solve the problem. But the common case is that the parent is screwed
up or simply so badly programmed that the only way of getting rid of
the zombies is to kill the parent...

    Anyway I suppose that sending the SIGCHLD won't do any harm so it
may be worth trying.

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 16:30                 ` Pedro Venda (SYSADM)
@ 2004-11-04 22:28                   ` Helge Hafting
  0 siblings, 0 replies; 99+ messages in thread
From: Helge Hafting @ 2004-11-04 22:28 UTC (permalink / raw)
  To: Pedro Venda (SYSADM); +Cc: linux-kernel, Jim Nelson

On Thu, Nov 04, 2004 at 04:30:47PM +0000, Pedro Venda (SYSADM) wrote:
> Jim Nelson wrote:
> >DervishD wrote:
> 
> the exact same happened to me, but my case was with ntfs. zip processes 
> just got stuch in "D" state because of some unhandled names... i 
> couldn't kill the processes. i don't think this is an easy thing to do, 
> tough it should be possible to kill -9 these processes and make them exit.
> 
> is this feasible?
> 
The correct approach here is to fix ntfs so it doesn't make processes
wait forever for anything.  There is no need for a workaround.

Helge Hafting

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:07         ` Matthias Andree
@ 2004-11-04 22:31           ` Peter Chubb
  2004-11-04 23:33           ` Benno
  1 sibling, 0 replies; 99+ messages in thread
From: Peter Chubb @ 2004-11-04 22:31 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Gene Heskett, linux-kernel

>>>>> "Matthias" == Matthias Andree <matthias.andree@gmx.de> writes:

Matthias> On Wed, 03 Nov 2004, Gene Heskett wrote:
>> >Yes it does - the problem is that not all resources are managed
>> >by processes.  Some allocations are managed by drivers, so a
>> driver >bug can get the device into a unuseable state _and_ tie up
>> the >process(es) that were using the driver at the moment.
>> 
>> This from my viewpoint, is wrong.  The kernel, and only the kernel
>> should be ultimately responsible for handing out resources, and
>> reclaiming at its convienience.

Matthias> Linux's driver model is the way it is. If you want the
Matthias> kernel to clean up after a driver has puked, you need
Matthias> something like a microkernel I believe, where only a minimal
Matthias> core kernel is a real kernel and where all the drivers are
Matthias> actually in user-space, but that's no longer Linux then.

Matthias> I'm not reflecting the down- and upsides to of this as I
Matthias> have no experience with microkernels (and have never used
Matthias> OS9 or GNU Hurd either). I know there have been attempts to
Matthias> port Linux to a Microkernel but I don't know what's come out
Matthias> of it.

There are actually several ports of Linux onto microkernels, but the
only one I know anything about is the Wombat project here at UNSW.

Linux running on the L4 microkernel runs at around the same speed as
on the bare metal.  The home page is at
http://www.disy.cse.unsw.edu.au/Software/Wombat/ but there's not much
there yet.

-- 
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 10:07         ` Matthias Andree
  2004-11-04 22:31           ` Peter Chubb
@ 2004-11-04 23:33           ` Benno
  1 sibling, 0 replies; 99+ messages in thread
From: Benno @ 2004-11-04 23:33 UTC (permalink / raw)
  To: Gene Heskett, linux-kernel

On Thu Nov 04, 2004 at 11:07:49 +0100, Matthias Andree wrote:
>On Wed, 03 Nov 2004, Gene Heskett wrote:
>
>> >Yes it does - the problem is that not all resources are managed
>> >by processes.  Some allocations are managed by drivers, so a driver
>> >bug can get the device into a unuseable state _and_ tie up the
>> >process(es) that were using the driver at the moment.
>> 
>> This from my viewpoint, is wrong.  The kernel, and only the kernel 
>> should be ultimately responsible for handing out resources, and 
>> reclaiming at its convienience.
>
>Linux's driver model is the way it is. If you want the kernel to clean
>up after a driver has puked, you need something like a microkernel I
>believe, where only a minimal core kernel is a real kernel and where all
>the drivers are actually in user-space, but that's no longer Linux then.

Of course some drivers are already in user-space on Linux. (E.g: X
graphics cards). Work by the Gelato project has added support to the
Linux kernel to allow more complicated drivers (e.g: those requiring
interrupts) to be run outside the kernel on Linux.

http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/usrdrivers/

Cheers,

Benno

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-03 20:48 ` Tom Felker
  2004-11-03 21:08   ` Gene Heskett
@ 2004-11-05  0:29   ` Gene Heskett
  1 sibling, 0 replies; 99+ messages in thread
From: Gene Heskett @ 2004-11-05  0:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tom Felker

On Wednesday 03 November 2004 15:48, Tom Felker wrote:
[...]
>> Isn't there some way to clean up a &^$#^#@)_ zombie?
>
>Ok, let me try to explain what probably happened.
>
>First, terminology.  When one process wants to be come two
> processes, it fork()s.  One process is the parent, and one it the
> child.  The child usually exec()s to become a different program. 
> The parent sometimes wants to know when the child ends and whether
> it succeeded.  Thus, the wait() system calls. The parent can either
> check whether a child died, or go to sleep until one does.  When
> the parent is awaken, it's told which child died and what the
> child's exit status was (usually 0 for success).  But if the child
> dies before the parent wait()s, the kernel must keep a record of
> which child died and what its exit status was, and it can't
> reassign the late child's PID yet. This record is a "zombie," and
> shows up under top or ps with the 'Z' state. Zombies do _not_ hold
> open files, memory, or resources of any kind.
>
>That's the technical definition of a zombie, which I'm telling you
> because that's probably not your situation:  I assume you used
> "zombie" as an informal term for a process that you can't kill. 
> Your problem is a process in uninterruptible sleep (the "D" state).
>
>When a process executing in userspace wants information from a
> device, like a disk or TV capture card, it calls read(), and
> context switches into kernel space.  Usually, it will take a moment
> for the data to be available from the device, so the process gets
> put on a wait queue so other processes can run. Obviously nothing
> is deallocated, because everyone expects the process will get it's
> data and proceed as normal.  When the device has the data, it
> interrupts the CPU, and the kernel figures out who wanted the data
> and puts them on the run queue.
>
>When a process is on a wait queue waiting for data from a device
> (the D state), it's impossible to kill.  This is because otherwise,
> when the interrupt did come, the structures associated with the
> process would have been freed, and the kernel would crash.  It
> would require an incredible amount of innefficient bookkeeping to
> avoid this, and it's unnecessary because normally, the data request
> will finish (successfully or not), and the process will be woken
> up, or if it was sent SIGKILL, it will be killed.
>
>Long story short, what happened was, some faulty hardware or some
> buggy driver, probably associated with the capture card, had a
> problem and left the process in D state.  Thus, it couldn't be
> killed, and since it had /dev/video open, tvtime couldn't run and
> failed gracefully, and because it held /dev/dsp open, and couldn't
> be killed as the init scripts would normally do in that situation,
> the audio drivers couldn't be unloaded and the boot process hung.
>
>So give us a bunch of information about what hardware you're using,
> output of dmesg, and steps to reproduce the driver bug (if it is
> that).

I cannot do that as it apparently was a transient thing.  After the 
reboot to the next kernel in the series, everythings has been working 
as well as can be expected.  I've listened to the radio for about 30 
seconds, and the tv maybe 6 hours since.
Now that I know howto make the magic sysrq actually work and leave 
meaningfull stuff in the logs, maybe I can report something that 
might be constructive the next time it happens.  Until then, I wait 
for the other shoe I guess.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.28% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04  6:39                       ` Denis Vlasenko
@ 2004-11-05  2:38                         ` Elladan
  2004-11-05  3:10                           ` Tim Connors
  0 siblings, 1 reply; 99+ messages in thread
From: Elladan @ 2004-11-05  2:38 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Russell Miller, Doug McNaught, Jim Nelson, DervishD,
	Gene Heskett, linux-kernel, M?ns Rullg?rd

On Thu, Nov 04, 2004 at 08:39:34AM +0200, Denis Vlasenko wrote:
> On Thursday 04 November 2004 01:33, Russell Miller wrote:
> > On Wednesday 03 November 2004 17:03, Doug McNaught wrote:
> > 
> > > It was already mentioned in this thread that the bookkeeping required
> > > to clean up properly from such an abort would add a lot of overhead
> > > and slow down the normal, non-buggy case.
> > >
> > I am going to continue pursuing this at the risk of making a bigger fool of 
> > myself than I already am, but I want to make sure that I understand the 
> > issues - and I did read the message you are referring to.
> > 
> > I think what you are saying is that there is kind of a race condition here.  
> > When something is on the wait queue, it has to be followed through to 
> > completion.  An interrupt could be received at any time, and if it's taken 
> > off of the wait queue prematurely, it'll crash the kernel, because the 
> > interrupt has no way of telling that.
> 
> The problem is in locking. You must not kill process while it is
> in uninterruptible state because it is uninterruptible
> for a reason - has taken semaphore, or get_cpu(), etc.
> You do want it to do put_cpu(), right?
> 
> Processes must never get stuck in D, it's a kernel bug.
> 
> Find out how did process ended up in D state forever,
> and fix it - that's what I'm trying to do
> in these cases.

Perhaps it would be useful to add some debugging to the kernel for these
cases, somewhat akin to Ingo's preempt trace stuff?

If a process is in D state and receives a SIGKILL, assume it must exit
within a few seconds or it's a bug, and dump as much information about
it as is practical...?

-J


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-05  2:38                         ` Elladan
@ 2004-11-05  3:10                           ` Tim Connors
  2004-11-05  3:17                             ` Russell Miller
                                               ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Tim Connors @ 2004-11-05  3:10 UTC (permalink / raw)
  To: Elladan
  Cc: Denis Vlasenko, Russell Miller, Doug McNaught, Jim Nelson,
	DervishD, Gene Heskett, linux-kernel, M?ns Rullg?rd

Elladan <elladan@eskimo.com> said on Thu, 4 Nov 2004 18:38:50 -0800:
> If a process is in D state and receives a SIGKILL, assume it must exit
> within a few seconds or it's a bug, and dump as much information about
> it as is practical...?

Of course, it's not necessarily a bug. Someone could have just kicked
the ethernet, and so your process is stuck waiting for a read/write.

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Theoretically one might have been wearing pants at work.
        -- Anthony de Boer in Scary Devil Monastry

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-05  3:10                           ` Tim Connors
@ 2004-11-05  3:17                             ` Russell Miller
  2004-11-05  4:38                             ` Elladan
  2004-11-05  5:00                             ` Kyle Moffett
  2 siblings, 0 replies; 99+ messages in thread
From: Russell Miller @ 2004-11-05  3:17 UTC (permalink / raw)
  To: Tim Connors
  Cc: Elladan, Denis Vlasenko, Doug McNaught, Jim Nelson, DervishD,
	Gene Heskett, linux-kernel, M?ns Rullg?rd

On Thursday 04 November 2004 21:10, Tim Connors wrote:

> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

But it *is* a process hung in D state after you sent it a kill.  It's safe to 
assume, at least, that something is screwed up somewhere.  More information 
is always a good thing.

--Russell

-- 

Russell Miller - rmiller@duskglow.com - Le Mars, IA
Duskglow Consulting - Helping companies just like you to succeed for ~ 10 yrs.
http://www.duskglow.com - 712-546-5886

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-05  3:10                           ` Tim Connors
  2004-11-05  3:17                             ` Russell Miller
@ 2004-11-05  4:38                             ` Elladan
  2004-11-05  5:00                             ` Kyle Moffett
  2 siblings, 0 replies; 99+ messages in thread
From: Elladan @ 2004-11-05  4:38 UTC (permalink / raw)
  To: Tim Connors
  Cc: Elladan, Denis Vlasenko, Russell Miller, Doug McNaught,
	Jim Nelson, DervishD, Gene Heskett, linux-kernel, M?ns Rullg?rd

On Fri, Nov 05, 2004 at 02:10:35PM +1100, Tim Connors wrote:
> Elladan <elladan@eskimo.com> said on Thu, 4 Nov 2004 18:38:50 -0800:
> > If a process is in D state and receives a SIGKILL, assume it must exit
> > within a few seconds or it's a bug, and dump as much information about
> > it as is practical...?
> 
> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

Sounds like a bug to me.  Kernel resource leak due to network activity?

-J

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-05  3:10                           ` Tim Connors
  2004-11-05  3:17                             ` Russell Miller
  2004-11-05  4:38                             ` Elladan
@ 2004-11-05  5:00                             ` Kyle Moffett
  2 siblings, 0 replies; 99+ messages in thread
From: Kyle Moffett @ 2004-11-05  5:00 UTC (permalink / raw)
  To: Tim Connors
  Cc: Denis Vlasenko, DervishD, Russell Miller, Elladan, linux-kernel,
	Jim Nelson, M?ns Rullg?rd, Gene Heskett, Doug McNaught

On Nov 04, 2004, at 22:10, Tim Connors wrote:
> Elladan <elladan@eskimo.com> said on Thu, 4 Nov 2004 18:38:50 -0800:
>> If a process is in D state and receives a SIGKILL, assume it must exit
>> within a few seconds or it's a bug, and dump as much information about
>> it as is practical...?
>
> Of course, it's not necessarily a bug. Someone could have just kicked
> the ethernet, and so your process is stuck waiting for a read/write.

In any case, if a process is sleeping in-kernel, I expect that either 
it's an
interruptible sleep or a guaranteed-short sleep.  If it's neither, it's 
a bug.  If
I kick out an ethernet and it makes "ping" hang in "D", that's bad.  I 
think
that eventually _all_ kernel sleeps on the behalf of user-space 
processes
will become interruptible.

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a17 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r  
!y?(-)
------END GEEK CODE BLOCK------



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-04 21:11             ` DervishD
@ 2004-11-09 23:31               ` Bill Davidsen
  2004-11-10  9:11                 ` DervishD
  0 siblings, 1 reply; 99+ messages in thread
From: Bill Davidsen @ 2004-11-09 23:31 UTC (permalink / raw)
  To: DervishD; +Cc: Måns Rullgård, linux-kernel

DervishD wrote:
>     Hi Bill :)
> 
>  * Bill Davidsen <davidsen@tmr.com> dixit:
> 
>>>   If you are talking about others' children, then your call to
>>>waitpid() (or wait()) failed with ECHILD: not your child.
>>
>>That's what happened when I tried it a few months ago. I suppose one 
>>could try sending a SIGCHLD to the parent and see if it does something 
>>helpful.
> 
> 
>     Probably it won't do. If the zombies are there due to a signal
> delivery problem, sending a SIGCHLD to the parent will (probably)
> solve the problem. But the common case is that the parent is screwed
> up or simply so badly programmed that the only way of getting rid of
> the zombies is to kill the parent...

Wait a minute, in another message you just suggested that a SIGCHLD to 
init would cause the status to be reaped.
> 
>     Anyway I suppose that sending the SIGCHLD won't do any harm so it
> may be worth trying.

It won't hurt init, but some processes do use the SIGCHLD to trigger a 
wait(), which might hang the parent.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: is killing zombies possible w/o a reboot?
  2004-11-09 23:31               ` Bill Davidsen
@ 2004-11-10  9:11                 ` DervishD
  0 siblings, 0 replies; 99+ messages in thread
From: DervishD @ 2004-11-10  9:11 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Måns Rullgård, linux-kernel

    Hi Bill :)

 * Bill Davidsen <davidsen@tmr.com> dixit:
> >    Probably it won't do. If the zombies are there due to a signal
> >delivery problem, sending a SIGCHLD to the parent will (probably)
> >solve the problem. But the common case is that the parent is screwed
> >up or simply so badly programmed that the only way of getting rid of
> >the zombies is to kill the parent...
> Wait a minute, in another message you just suggested that a SIGCHLD to 
> init would cause the status to be reaped.

    I don't consider init the parent of such processes. It just
'adopts' them when the real parent doesn't care for them. I was
talking, in the paragraph above, about the *real* parent. I don't see
any contradiction, although sending SIGCHLD to a program that has not
waited for a children is risky: if the programmer was so clueless
that children were not waited for in the first place, chances are
that SIGCHLD handling is damaged, too.

> >    Anyway I suppose that sending the SIGCHLD won't do any harm so it
> >may be worth trying.
> It won't hurt init, but some processes do use the SIGCHLD to trigger a 
> wait(), which might hang the parent.

    If a parent does 'wait()' instead of 'waitpid', that's lazy
programming. The signal won't hurt anyway: if the parent blocks (bug
in the program), then a 'kill -9' is the correct medication (it's
what I use for buggy programs), the children are reparented to init
and correctly handled (because a good init should, IMHO, use waitpid
instead of wait). Let's say that sending SIGCHLD is 'mostly harmless'
;))

    Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.dervishd.net & http://www.pleyades.net/

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2004-11-10  9:29 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-03 12:51 is killing zombies possible w/o a reboot? Gene Heskett
2004-11-03 14:33 ` bert hubert
2004-11-03 14:49   ` Måns Rullgård
2004-11-03 15:25     ` DervishD
2004-11-03 15:25       ` Måns Rullgård
2004-11-03 17:49         ` DervishD
2004-11-03 16:47       ` Gene Heskett
2004-11-03 17:44         ` DervishD
2004-11-03 18:53           ` Gene Heskett
2004-11-03 19:01             ` Doug McNaught
2004-11-03 19:03             ` Måns Rullgård
2004-11-03 19:24               ` Gene Heskett
2004-11-03 19:33                 ` Doug McNaught
2004-11-03 19:34                 ` Måns Rullgård
2004-11-03 19:06             ` Valdis.Kletnieks
2004-11-03 19:26               ` Gene Heskett
2004-11-03 19:33                 ` Valdis.Kletnieks
2004-11-03 20:09                   ` Gene Heskett
2004-11-04 19:24                     ` Bill Davidsen
2004-11-03 19:42                 ` DervishD
2004-11-03 23:12                   ` Bill Davidsen
2004-11-04 10:26                     ` DervishD
2004-11-04 14:23                       ` Paul Slootman
2004-11-04 14:56                         ` Gene Heskett
2004-11-04 18:24                         ` DervishD
2004-11-04 19:22                       ` Bill Davidsen
2004-11-04 20:53                         ` DervishD
2004-11-03 19:26             ` DervishD
2004-11-03 20:18               ` Gene Heskett
2004-11-03 22:15               ` Jim Nelson
2004-11-03 22:44                 ` Russell Miller
2004-11-03 23:03                   ` Doug McNaught
2004-11-03 23:33                     ` Russell Miller
2004-11-03 23:47                       ` Mathieu Segaud
2004-11-03 23:56                         ` Russell Miller
2004-11-04  0:05                           ` Mathieu Segaud
2004-11-04  6:39                       ` Denis Vlasenko
2004-11-05  2:38                         ` Elladan
2004-11-05  3:10                           ` Tim Connors
2004-11-05  3:17                             ` Russell Miller
2004-11-05  4:38                             ` Elladan
2004-11-05  5:00                             ` Kyle Moffett
2004-11-04 20:06                       ` Bill Davidsen
2004-11-03 23:06                   ` vlobanov
2004-11-04 10:04                   ` Helge Hafting
2004-11-04 17:16                     ` Alex Bennee
2004-11-04 16:30                 ` Pedro Venda (SYSADM)
2004-11-04 22:28                   ` Helge Hafting
2004-11-03 23:07               ` Bill Davidsen
2004-11-04  1:19                 ` Michael Clark
2004-11-04 16:01         ` kernel
2004-11-04 16:18           ` Gene Heskett
2004-11-04 16:47             ` kernel
2004-11-04 17:58               ` Gene Heskett
2004-11-03 22:58       ` Bill Davidsen
2004-11-04 10:23         ` DervishD
2004-11-04 19:32           ` Bill Davidsen
2004-11-04 21:11             ` DervishD
2004-11-09 23:31               ` Bill Davidsen
2004-11-10  9:11                 ` DervishD
2004-11-03 23:18       ` Adam Heath
2004-11-03 16:38     ` Gene Heskett
2004-11-03 16:24   ` Gene Heskett
2004-11-03 16:46     ` linux-os
2004-11-03 19:12       ` Gene Heskett
2004-11-03 19:56       ` Måns Rullgård
2004-11-03 20:13     ` Helge Hafting
2004-11-03 20:40       ` Gene Heskett
2004-11-04  0:43         ` Kurt Wall
2004-11-04  1:01           ` Russell Miller
2004-11-04  1:38             ` Doug McNaught
2004-11-04  1:45               ` Russell Miller
2004-11-04  1:56                 ` Doug McNaught
2004-11-04  1:59                 ` Mitchell Blank Jr
2004-11-04 20:10                   ` Bill Davidsen
2004-11-04 10:07         ` Matthias Andree
2004-11-04 22:31           ` Peter Chubb
2004-11-04 23:33           ` Benno
2004-11-03 20:48 ` Tom Felker
2004-11-03 21:08   ` Gene Heskett
2004-11-04  7:19     ` Jan Knutar
2004-11-04 11:57       ` Gene Heskett
2004-11-04 12:12         ` Jan Knutar
2004-11-04 12:18           ` Gene Heskett
2004-11-04 12:29             ` Jan Knutar
2004-11-04 13:56               ` Gene Heskett
2004-11-04 12:39           ` Gene Heskett
2004-11-04 13:01             ` Ian Campbell
2004-11-04 14:07               ` Gene Heskett
2004-11-04 14:24                 ` Ian Campbell
2004-11-04 15:10                   ` Gene Heskett
2004-11-04 14:26                 ` DervishD
2004-11-04 15:13                   ` Gene Heskett
2004-11-04 13:10             ` Doug McNaught
2004-11-04 14:11               ` Gene Heskett
2004-11-04 14:42                 ` tlaurent
2004-11-04 15:14                   ` Gene Heskett
2004-11-04 20:18             ` Bill Davidsen
2004-11-05  0:29   ` Gene Heskett

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.