All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug] child processes stall forever and don't get killed
       [not found] <1139550397.1201862.1473415639192.JavaMail.zimbra@redhat.com>
@ 2016-09-09 10:30 ` Jan Stancek
  2016-09-09 13:32   ` Dave Jones
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Stancek @ 2016-09-09 10:30 UTC (permalink / raw)
  To: trinity; +Cc: jstancek, davej

Hi,

I'm running v1.6-643-gecea2b06d5f3 on RHEL7.3 and I'm seeing an issue
where all child processes stall and none of them is getting killed.
They are usually in a syscalls like read, recv, nanosleep, etc.

I suspect this commit introduced the problem, because any syscall
that started but not completed is now considered to "make progress":

  commit ecf6dfd83d4c886d78d4605163cb8c3f1728db62
  Author: Dave Jones <davej@codemonkey.org.uk>
  Date:   Fri Aug 12 15:05:01 2016 -0400

    if we haven't done a syscall yet, treat child as "making progress".
    
    Chances are that we haven't been scheduled because some other
    children are hogging the cpu.

I'm seeing more the opposite of what commit above says. Most CPUs
are idle, because N-1 children are stuck in recv/read/...
and last child manages to keep going. Then by a chance it also hits
a syscall that doesn't complete and system stays idle
(after ~hour I gave up waiting).

Regards,
Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] child processes stall forever and don't get killed
  2016-09-09 10:30 ` [bug] child processes stall forever and don't get killed Jan Stancek
@ 2016-09-09 13:32   ` Dave Jones
  2016-09-09 14:16     ` Jan Stancek
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Jones @ 2016-09-09 13:32 UTC (permalink / raw)
  To: Jan Stancek; +Cc: trinity

On Fri, Sep 09, 2016 at 06:30:16AM -0400, Jan Stancek wrote:
 > Hi,
 > 
 > I'm running v1.6-643-gecea2b06d5f3 on RHEL7.3 and I'm seeing an issue
 > where all child processes stall and none of them is getting killed.
 > They are usually in a syscalls like read, recv, nanosleep, etc.
 > 
 > I suspect this commit introduced the problem, because any syscall
 > that started but not completed is now considered to "make progress":
 > 
 >   commit ecf6dfd83d4c886d78d4605163cb8c3f1728db62
 >   Author: Dave Jones <davej@codemonkey.org.uk>
 >   Date:   Fri Aug 12 15:05:01 2016 -0400
 > 
 >     if we haven't done a syscall yet, treat child as "making progress".
 >     
 >     Chances are that we haven't been scheduled because some other
 >     children are hogging the cpu.
 > 
 > I'm seeing more the opposite of what commit above says. Most CPUs
 > are idle, because N-1 children are stuck in recv/read/...
 > and last child manages to keep going. Then by a chance it also hits
 > a syscall that doesn't complete and system stays idle
 > (after ~hour I gave up waiting).

Need to think some more on this, but as a quick guess...
try replacing the <= BEFORE with < BEFORE

I'll try and find some time to look into this soon. I'm surprised I
haven't also seen it happen though.  How many CPUs & how many child
processes ?

	Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] child processes stall forever and don't get killed
  2016-09-09 13:32   ` Dave Jones
@ 2016-09-09 14:16     ` Jan Stancek
  2016-09-10  1:46       ` Dave Jones
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Stancek @ 2016-09-09 14:16 UTC (permalink / raw)
  To: Dave Jones; +Cc: trinity



----- Original Message -----
> From: "Dave Jones" <davej@codemonkey.org.uk>
> To: "Jan Stancek" <jstancek@redhat.com>
> Cc: trinity@vger.kernel.org
> Sent: Friday, 9 September, 2016 3:32:36 PM
> Subject: Re: [bug] child processes stall forever and don't get killed
> 
> On Fri, Sep 09, 2016 at 06:30:16AM -0400, Jan Stancek wrote:
>  > Hi,
>  > 
>  > I'm running v1.6-643-gecea2b06d5f3 on RHEL7.3 and I'm seeing an issue
>  > where all child processes stall and none of them is getting killed.
>  > They are usually in a syscalls like read, recv, nanosleep, etc.
>  > 
>  > I suspect this commit introduced the problem, because any syscall
>  > that started but not completed is now considered to "make progress":
>  > 
>  >   commit ecf6dfd83d4c886d78d4605163cb8c3f1728db62
>  >   Author: Dave Jones <davej@codemonkey.org.uk>
>  >   Date:   Fri Aug 12 15:05:01 2016 -0400
>  > 
>  >     if we haven't done a syscall yet, treat child as "making progress".
>  >     
>  >     Chances are that we haven't been scheduled because some other
>  >     children are hogging the cpu.
>  > 
>  > I'm seeing more the opposite of what commit above says. Most CPUs
>  > are idle, because N-1 children are stuck in recv/read/...
>  > and last child manages to keep going. Then by a chance it also hits
>  > a syscall that doesn't complete and system stays idle
>  > (after ~hour I gave up waiting).
> 
> Need to think some more on this, but as a quick guess...
> try replacing the <= BEFORE with < BEFORE

I've started new test with patch above reverted and that looks good
so far. No stalls after 1 hour. Previously it stalled after ~20-30
minutes. I noticed that when syscall stat messages (those which show
number of iteration) stopped appearing.

> 
> I'll try and find some time to look into this soon. I'm surprised I
> haven't also seen it happen though.  How many CPUs & how many child
> processes ?

Anywhere from 2-8 CPUs, 8-32 children on x86_64, ppc64le and s390x
systems (RHEL7.3 Beta). It happened usually within 20-30 minutes.

Regards,
Jan

> 
> 	Dave
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] child processes stall forever and don't get killed
  2016-09-09 14:16     ` Jan Stancek
@ 2016-09-10  1:46       ` Dave Jones
  2016-09-13 12:00         ` Jan Stancek
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Jones @ 2016-09-10  1:46 UTC (permalink / raw)
  To: Jan Stancek; +Cc: trinity

On Fri, Sep 09, 2016 at 10:16:17AM -0400, Jan Stancek wrote:
 
 > >  > I'm seeing more the opposite of what commit above says. Most CPUs
 > >  > are idle, because N-1 children are stuck in recv/read/...
 > >  > and last child manages to keep going. Then by a chance it also hits
 > >  > a syscall that doesn't complete and system stays idle
 > >  > (after ~hour I gave up waiting).
 > > 
 > > Need to think some more on this, but as a quick guess...
 > > try replacing the <= BEFORE with < BEFORE
 > 
 > I've started new test with patch above reverted and that looks good
 > so far. No stalls after 1 hour. Previously it stalled after ~20-30
 > minutes. I noticed that when syscall stat messages (those which show
 > number of iteration) stopped appearing.

Ok, I committed that, but with a minor change to widen how long we spend
in BEFORE state slightly. I doubt that part will have a negative effect,
but holler if it does..

 > > I'll try and find some time to look into this soon. I'm surprised I
 > > haven't also seen it happen though.  How many CPUs & how many child
 > > processes ?
 > 
 > Anywhere from 2-8 CPUs, 8-32 children on x86_64, ppc64le and s390x
 > systems (RHEL7.3 Beta). It happened usually within 20-30 minutes.

Weird. I'm doing 24/7 runs on one quad core and didn't hit it.
But I wonder if I was just fortunate enough that I had some children
always making progress even if N-1 were stuck.

	Dave

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug] child processes stall forever and don't get killed
  2016-09-10  1:46       ` Dave Jones
@ 2016-09-13 12:00         ` Jan Stancek
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Stancek @ 2016-09-13 12:00 UTC (permalink / raw)
  To: Dave Jones; +Cc: trinity



----- Original Message -----
> From: "Dave Jones" <davej@codemonkey.org.uk>
> To: "Jan Stancek" <jstancek@redhat.com>
> Cc: trinity@vger.kernel.org
> Sent: Saturday, 10 September, 2016 3:46:30 AM
> Subject: Re: [bug] child processes stall forever and don't get killed
> 
> On Fri, Sep 09, 2016 at 10:16:17AM -0400, Jan Stancek wrote:
>  
>  > >  > I'm seeing more the opposite of what commit above says. Most CPUs
>  > >  > are idle, because N-1 children are stuck in recv/read/...
>  > >  > and last child manages to keep going. Then by a chance it also hits
>  > >  > a syscall that doesn't complete and system stays idle
>  > >  > (after ~hour I gave up waiting).
>  > > 
>  > > Need to think some more on this, but as a quick guess...
>  > > try replacing the <= BEFORE with < BEFORE
>  > 
>  > I've started new test with patch above reverted and that looks good
>  > so far. No stalls after 1 hour. Previously it stalled after ~20-30
>  > minutes. I noticed that when syscall stat messages (those which show
>  > number of iteration) stopped appearing.
> 
> Ok, I committed that, but with a minor change to widen how long we spend
> in BEFORE state slightly. I doubt that part will have a negative effect,
> but holler if it does..

I applied this patch and I haven't seen stalls in over-night test.

Thanks,
Jan

> 
>  > > I'll try and find some time to look into this soon. I'm surprised I
>  > > haven't also seen it happen though.  How many CPUs & how many child
>  > > processes ?
>  > 
>  > Anywhere from 2-8 CPUs, 8-32 children on x86_64, ppc64le and s390x
>  > systems (RHEL7.3 Beta). It happened usually within 20-30 minutes.
> 
> Weird. I'm doing 24/7 runs on one quad core and didn't hit it.
> But I wonder if I was just fortunate enough that I had some children
> always making progress even if N-1 were stuck.
> 
> 	Dave
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-09-13 12:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1139550397.1201862.1473415639192.JavaMail.zimbra@redhat.com>
2016-09-09 10:30 ` [bug] child processes stall forever and don't get killed Jan Stancek
2016-09-09 13:32   ` Dave Jones
2016-09-09 14:16     ` Jan Stancek
2016-09-10  1:46       ` Dave Jones
2016-09-13 12:00         ` Jan Stancek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.