All of lore.kernel.org
 help / color / mirror / Atom feed
* Threads stuck in zap_pid_ns_processes()
@ 2017-05-11 17:11 Guenter Roeck
  2017-05-11 17:31 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-11 17:11 UTC (permalink / raw)
  To: Eric W. Biederman, Ingo Molnar, linux-kernel; +Cc: Vovo Yang

Hi all,

the test program attached below almost always results in one of the child
processes being stuck in zap_pid_ns_processes(). When this happens, I can
see from test logs that nr_hashed == 2 and init_pids==1, but there is only
a single thread left in the pid namespace (the one that is stuck).
Traceback from /proc/<pid>/stack is

[<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
[<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
[<ffffffff810c1ee6>] do_group_exit+0x86/0x130
[<ffffffff810d4347>] get_signal+0x367/0x8a0
[<ffffffff81046e73>] do_signal+0x83/0xb90
[<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
[<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
[<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

After 120 seconds, I get the "hung task" message.

Example from v4.11:

...
[ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
[ 3263.379561]       Not tainted 4.11.0+ #1
[ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3263.379577] clone           D    0 27910  27909 0x00000000
[ 3263.379587] Call Trace:
[ 3263.379608]  __schedule+0x677/0xda0
[ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
[ 3263.379634]  ? task_stopped_code+0x70/0x70
[ 3263.379643]  schedule+0x4d/0xd0
[ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
[ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
[ 3263.379670]  do_exit+0x10d4/0x1330
...

The problem is seen in all kernels up to v4.11.

Any idea what might be going on and how to fix the problem ?

Thanks,
Guenter

---
This test program was kindly provided by Vovo Yang <vovoy@google.com>.

Note that the ptrace() call in child1() is not necessary for the problem
to be seen, though it seems to make it a bit more likely.
---

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <errno.h>
#include <string.h>
#include <sched.h>

#define STACK_SIZE 65536

int child1(void* arg);
int child2(void* arg);

int main(int argc, char **argv)
{
  int child_pid;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  char command[256];

  child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
  if (child_pid == -1) {
    printf("parent: clone failed: %s\n", strerror(errno));
    return EXIT_FAILURE;
  }
  printf("parent: child1_pid: %d\n", child_pid);

  sleep(2);
  printf("child state, if it's D (disk sleep), the child process is hung\n");
  sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
  system(command);
  sleep(3600);
  return EXIT_SUCCESS;
}

int child1(void* arg)
{
  int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
  char* child_stack = malloc(STACK_SIZE);
  char* stack_top = child_stack + STACK_SIZE;
  long ret;

  ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
    printf("child1: ptrace failed: %s\n", strerror(errno));
    return EXIT_FAILURE;
  }

  ret = clone(&child2, stack_top, flags, NULL);
  if (ret == -1) {
    printf("child1: clone failed: %s\n", strerror(errno));
    return EXIT_FAILURE;
  }
  printf("child1: child2 pid: %ld\n", ret);

  sleep(1);
  printf("child1: end\n");
  return EXIT_SUCCESS;
}

int child2(void* arg)
{
  long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
  if (ret == -1) {
    printf("child2: ptrace failed: %s\n", strerror(errno));
    return EXIT_FAILURE;
  }

  printf("child2: end\n");
  return EXIT_SUCCESS;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 17:11 Threads stuck in zap_pid_ns_processes() Guenter Roeck
@ 2017-05-11 17:31 ` Eric W. Biederman
  2017-05-11 18:35   ` Guenter Roeck
  2017-05-11 20:21   ` Guenter Roeck
  0 siblings, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-11 17:31 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> Hi all,
>
> the test program attached below almost always results in one of the child
> processes being stuck in zap_pid_ns_processes(). When this happens, I can
> see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> a single thread left in the pid namespace (the one that is stuck).
> Traceback from /proc/<pid>/stack is
>
> [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> [<ffffffff810d4347>] get_signal+0x367/0x8a0
> [<ffffffff81046e73>] do_signal+0x83/0xb90
> [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> After 120 seconds, I get the "hung task" message.
>
> Example from v4.11:
>
> ...
> [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> [ 3263.379561]       Not tainted 4.11.0+ #1
> [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3263.379577] clone           D    0 27910  27909 0x00000000
> [ 3263.379587] Call Trace:
> [ 3263.379608]  __schedule+0x677/0xda0
> [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> [ 3263.379634]  ? task_stopped_code+0x70/0x70
> [ 3263.379643]  schedule+0x4d/0xd0
> [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> [ 3263.379670]  do_exit+0x10d4/0x1330
> ...
>
> The problem is seen in all kernels up to v4.11.
>
> Any idea what might be going on and how to fix the problem ?

Let me see.  Reading the code it looks like we have three tasks
let's call them main, child1, and child2.

child1 and child2 are started using CLONE_THREAD and are
thus clones of one another.

child2 exits first but is ptraced by main so is not reaped.
       Further child2 calls do_group_exit forcing child1 to
       exit making for fun races.

       A ptread_exit() or syscall(SYS_exit, 0); would skip
       the group exit and make the window larger.

child1 exits next and calls zap_pid_ns_processes and is
       waiting for child2 to be reaped by main.

main is just sitting around doing nothing for 3600 seconds
not reaping anyone.

I would expect that when main exits everything would be cleaned up
and the only real issue is that we have a hung task warning.

Does everything cleanup when main exits?

Eric


>
> Thanks,
> Guenter
>
> ---
> This test program was kindly provided by Vovo Yang <vovoy@google.com>.
>
> Note that the ptrace() call in child1() is not necessary for the problem
> to be seen, though it seems to make it a bit more likely.

That would appear to just slow things down a smidge.    As there is
nothing substantial that happens ptrace wise except until after
zap_pid_ns_processes.


> ---
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/ptrace.h>
> #include <errno.h>
> #include <string.h>
> #include <sched.h>
>
> #define STACK_SIZE 65536
>
> int child1(void* arg);
> int child2(void* arg);
>
> int main(int argc, char **argv)
> {
>   int child_pid;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   char command[256];
>
>   child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
>   if (child_pid == -1) {
>     printf("parent: clone failed: %s\n", strerror(errno));
>     return EXIT_FAILURE;
>   }
>   printf("parent: child1_pid: %d\n", child_pid);
>
>   sleep(2);
>   printf("child state, if it's D (disk sleep), the child process is hung\n");
>   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
>   system(command);
>   sleep(3600);
>   return EXIT_SUCCESS;
> }
>
> int child1(void* arg)
> {
>   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
>   char* child_stack = malloc(STACK_SIZE);
>   char* stack_top = child_stack + STACK_SIZE;
>   long ret;
>
>   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
>     printf("child1: ptrace failed: %s\n", strerror(errno));
>     return EXIT_FAILURE;
>   }
>
>   ret = clone(&child2, stack_top, flags, NULL);
>   if (ret == -1) {
>     printf("child1: clone failed: %s\n", strerror(errno));
>     return EXIT_FAILURE;
>   }
>   printf("child1: child2 pid: %ld\n", ret);
>
>   sleep(1);
>   printf("child1: end\n");
>   return EXIT_SUCCESS;
> }
>
> int child2(void* arg)
> {
>   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>   if (ret == -1) {
>     printf("child2: ptrace failed: %s\n", strerror(errno));
>     return EXIT_FAILURE;
>   }
>
>   printf("child2: end\n");
>   return EXIT_SUCCESS;
> }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 17:31 ` Eric W. Biederman
@ 2017-05-11 18:35   ` Guenter Roeck
  2017-05-11 20:23     ` Eric W. Biederman
  2017-05-11 20:21   ` Guenter Roeck
  1 sibling, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-11 18:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc/<pid>/stack is
> >
> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
> > [<ffffffff81046e73>] do_signal+0x83/0xb90
> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]       Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>        Further child2 calls do_group_exit forcing child1 to
>        exit making for fun races.
> 
>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>        the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>        waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 
Yes, it does.

Problem is that if the main task doesn't exit, it hangs forever.
Chrome OS (where we see the problem in the field, and the application
is chrome) is configured to reboot on hung tasks - if a task is hung
for 120 seconds on those systems, it tends to be in a bad shape.
This makes it a quite severe problem for us.

I browsed through a number of stack traces - the ones I looked at
all have the same traceback from do_group_exit(), though the
caller of do_group_exit() is not always the same (that may depend
on the kernel version, though). Sometimes it is __wake_up_parent(),
sometimes it is get_signal_to_deliver(), sometimes it is
get_signal().

Are there other conditions besides ptrace where a task isn't reaped ?

Thanks,
Guenter

> Eric
> 
> 
> >
> > Thanks,
> > Guenter
> >
> > ---
> > This test program was kindly provided by Vovo Yang <vovoy@google.com>.
> >
> > Note that the ptrace() call in child1() is not necessary for the problem
> > to be seen, though it seems to make it a bit more likely.
> 
> That would appear to just slow things down a smidge.    As there is
> nothing substantial that happens ptrace wise except until after
> zap_pid_ns_processes.
> 
> 
> > ---
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <sys/ptrace.h>
> > #include <errno.h>
> > #include <string.h>
> > #include <sched.h>
> >
> > #define STACK_SIZE 65536
> >
> > int child1(void* arg);
> > int child2(void* arg);
> >
> > int main(int argc, char **argv)
> > {
> >   int child_pid;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   char command[256];
> >
> >   child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
> >   if (child_pid == -1) {
> >     printf("parent: clone failed: %s\n", strerror(errno));
> >     return EXIT_FAILURE;
> >   }
> >   printf("parent: child1_pid: %d\n", child_pid);
> >
> >   sleep(2);
> >   printf("child state, if it's D (disk sleep), the child process is hung\n");
> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
> >   system(command);
> >   sleep(3600);
> >   return EXIT_SUCCESS;
> > }
> >
> > int child1(void* arg)
> > {
> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
> >   char* child_stack = malloc(STACK_SIZE);
> >   char* stack_top = child_stack + STACK_SIZE;
> >   long ret;
> >
> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >   if (ret == -1) {
> >     printf("child1: ptrace failed: %s\n", strerror(errno));
> >     return EXIT_FAILURE;
> >   }
> >
> >   ret = clone(&child2, stack_top, flags, NULL);
> >   if (ret == -1) {
> >     printf("child1: clone failed: %s\n", strerror(errno));
> >     return EXIT_FAILURE;
> >   }
> >   printf("child1: child2 pid: %ld\n", ret);
> >
> >   sleep(1);
> >   printf("child1: end\n");
> >   return EXIT_SUCCESS;
> > }
> >
> > int child2(void* arg)
> > {
> >   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >   if (ret == -1) {
> >     printf("child2: ptrace failed: %s\n", strerror(errno));
> >     return EXIT_FAILURE;
> >   }
> >
> >   printf("child2: end\n");
> >   return EXIT_SUCCESS;
> > }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 17:31 ` Eric W. Biederman
  2017-05-11 18:35   ` Guenter Roeck
@ 2017-05-11 20:21   ` Guenter Roeck
  2017-05-11 21:25     ` Eric W. Biederman
  1 sibling, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-11 20:21 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > Hi all,
> >
> > the test program attached below almost always results in one of the child
> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> > a single thread left in the pid namespace (the one that is stuck).
> > Traceback from /proc/<pid>/stack is
> >
> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
> > [<ffffffff81046e73>] do_signal+0x83/0xb90
> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > After 120 seconds, I get the "hung task" message.
> >
> > Example from v4.11:
> >
> > ...
> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> > [ 3263.379561]       Not tainted 4.11.0+ #1
> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
> > [ 3263.379587] Call Trace:
> > [ 3263.379608]  __schedule+0x677/0xda0
> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> > [ 3263.379643]  schedule+0x4d/0xd0
> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> > [ 3263.379670]  do_exit+0x10d4/0x1330
> > ...
> >
> > The problem is seen in all kernels up to v4.11.
> >
> > Any idea what might be going on and how to fix the problem ?
> 
> Let me see.  Reading the code it looks like we have three tasks
> let's call them main, child1, and child2.
> 
> child1 and child2 are started using CLONE_THREAD and are
> thus clones of one another.
> 
> child2 exits first but is ptraced by main so is not reaped.
>        Further child2 calls do_group_exit forcing child1 to
>        exit making for fun races.
> 
>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>        the group exit and make the window larger.
> 
> child1 exits next and calls zap_pid_ns_processes and is
>        waiting for child2 to be reaped by main.
> 
> main is just sitting around doing nothing for 3600 seconds
> not reaping anyone.
> 
> I would expect that when main exits everything would be cleaned up
> and the only real issue is that we have a hung task warning.
> 
> Does everything cleanup when main exits?
> 

As an add-on to my previous mail: I added a function to count
the number of threads in the pid namespace, using next_pidmap().
Even though nr_hashed == 2, only the hanging thread is still
present.

Is there maybe a better way to terminate the wait loop than
with "if (pid_ns->nr_hashed == init_pids)" ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 18:35   ` Guenter Roeck
@ 2017-05-11 20:23     ` Eric W. Biederman
  2017-05-11 20:48       ` Guenter Roeck
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-11 20:23 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc/<pid>/stack is
>> >
>> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]       Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>        Further child2 calls do_group_exit forcing child1 to
>>        exit making for fun races.
>> 
>>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>>        the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>        waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
> Yes, it does.
>
> Problem is that if the main task doesn't exit, it hangs forever.
> Chrome OS (where we see the problem in the field, and the application
> is chrome) is configured to reboot on hung tasks - if a task is hung
> for 120 seconds on those systems, it tends to be in a bad shape.
> This makes it a quite severe problem for us.

We definitely need to change the code in such a way as to not
trigger the hung task warning.

> I browsed through a number of stack traces - the ones I looked at
> all have the same traceback from do_group_exit(), though the
> caller of do_group_exit() is not always the same (that may depend
> on the kernel version, though). Sometimes it is __wake_up_parent(),
> sometimes it is get_signal_to_deliver(), sometimes it is
> get_signal().
>
> Are there other conditions besides ptrace where a task isn't reaped ?

The problem here is that the parent was outside of the pid namespace
and was choosing not to reap the child.  The main task could easily
have reaped the child when it received SIGCHLD.

Arranging for forked children to run in a pid namespace with setns
can also create processes with parents outside the pid namespace,
that may or may not be reaped.

I suspect it is a bug to allow PTRACE_TRACEME if either the parent
or the child has called exec.  That is not enough to block your test
case but I think it would block real world problems.

For getting to the root cause of this I would see if I could get
a full process list.  It would be interesting to know what
the unreaped zombie is.

In the meantime I am going to see what it takes to change the code
so it doesn't trigger the hung task warning.  I don't want to change
the actual observed behavior of the code but I can avoid triggering
a warning that does not apply.

Eric


>
> Thanks,
> Guenter
>
>> Eric
>> 
>> 
>> >
>> > Thanks,
>> > Guenter
>> >
>> > ---
>> > This test program was kindly provided by Vovo Yang <vovoy@google.com>.
>> >
>> > Note that the ptrace() call in child1() is not necessary for the problem
>> > to be seen, though it seems to make it a bit more likely.
>> 
>> That would appear to just slow things down a smidge.    As there is
>> nothing substantial that happens ptrace wise except until after
>> zap_pid_ns_processes.
>> 
>> 
>> > ---
>> >
>> > #define _GNU_SOURCE
>> > #include <stdio.h>
>> > #include <stdlib.h>
>> > #include <unistd.h>
>> > #include <sys/ptrace.h>
>> > #include <errno.h>
>> > #include <string.h>
>> > #include <sched.h>
>> >
>> > #define STACK_SIZE 65536
>> >
>> > int child1(void* arg);
>> > int child2(void* arg);
>> >
>> > int main(int argc, char **argv)
>> > {
>> >   int child_pid;
>> >   char* child_stack = malloc(STACK_SIZE);
>> >   char* stack_top = child_stack + STACK_SIZE;
>> >   char command[256];
>> >
>> >   child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
>> >   if (child_pid == -1) {
>> >     printf("parent: clone failed: %s\n", strerror(errno));
>> >     return EXIT_FAILURE;
>> >   }
>> >   printf("parent: child1_pid: %d\n", child_pid);
>> >
>> >   sleep(2);
>> >   printf("child state, if it's D (disk sleep), the child process is hung\n");
>> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
>> >   system(command);
>> >   sleep(3600);
>> >   return EXIT_SUCCESS;
>> > }
>> >
>> > int child1(void* arg)
>> > {
>> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
>> >   char* child_stack = malloc(STACK_SIZE);
>> >   char* stack_top = child_stack + STACK_SIZE;
>> >   long ret;
>> >
>> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>> >   if (ret == -1) {
>> >     printf("child1: ptrace failed: %s\n", strerror(errno));
>> >     return EXIT_FAILURE;
>> >   }
>> >
>> >   ret = clone(&child2, stack_top, flags, NULL);
>> >   if (ret == -1) {
>> >     printf("child1: clone failed: %s\n", strerror(errno));
>> >     return EXIT_FAILURE;
>> >   }
>> >   printf("child1: child2 pid: %ld\n", ret);
>> >
>> >   sleep(1);
>> >   printf("child1: end\n");
>> >   return EXIT_SUCCESS;
>> > }
>> >
>> > int child2(void* arg)
>> > {
>> >   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
>> >   if (ret == -1) {
>> >     printf("child2: ptrace failed: %s\n", strerror(errno));
>> >     return EXIT_FAILURE;
>> >   }
>> >
>> >   printf("child2: end\n");
>> >   return EXIT_SUCCESS;
>> > }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 20:23     ` Eric W. Biederman
@ 2017-05-11 20:48       ` Guenter Roeck
  2017-05-11 21:39         ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-11 20:48 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck <linux@roeck-us.net> writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc/<pid>/stack is
> >> >
> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>        Further child2 calls do_group_exit forcing child1 to
> >>        exit making for fun races.
> >> 
> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>        the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>        waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> > Yes, it does.
> >
> > Problem is that if the main task doesn't exit, it hangs forever.
> > Chrome OS (where we see the problem in the field, and the application
> > is chrome) is configured to reboot on hung tasks - if a task is hung
> > for 120 seconds on those systems, it tends to be in a bad shape.
> > This makes it a quite severe problem for us.
> 
> We definitely need to change the code in such a way as to not
> trigger the hung task warning.
> 
> > I browsed through a number of stack traces - the ones I looked at
> > all have the same traceback from do_group_exit(), though the
> > caller of do_group_exit() is not always the same (that may depend
> > on the kernel version, though). Sometimes it is __wake_up_parent(),
> > sometimes it is get_signal_to_deliver(), sometimes it is
> > get_signal().
> >
> > Are there other conditions besides ptrace where a task isn't reaped ?
> 
> The problem here is that the parent was outside of the pid namespace
> and was choosing not to reap the child.  The main task could easily
> have reaped the child when it received SIGCHLD.
> 
> Arranging for forked children to run in a pid namespace with setns
> can also create processes with parents outside the pid namespace,
> that may or may not be reaped.
> 
> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
> or the child has called exec.  That is not enough to block your test
> case but I think it would block real world problems.
> 
> For getting to the root cause of this I would see if I could get
> a full process list.  It would be interesting to know what
> the unreaped zombie is.
> 
root     11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
root     11580 11579  0 13:38 pts/15   00:00:00 ./clone
root     11581 11580  0 13:38 pts/15   00:00:00 [clone]

I assume this means that it is child1 ? What other information
would be needed ?

Thanks,
Guenter

> In the meantime I am going to see what it takes to change the code
> so it doesn't trigger the hung task warning.  I don't want to change
> the actual observed behavior of the code but I can avoid triggering
> a warning that does not apply.
> 
> Eric
> 
> 
> >
> > Thanks,
> > Guenter
> >
> >> Eric
> >> 
> >> 
> >> >
> >> > Thanks,
> >> > Guenter
> >> >
> >> > ---
> >> > This test program was kindly provided by Vovo Yang <vovoy@google.com>.
> >> >
> >> > Note that the ptrace() call in child1() is not necessary for the problem
> >> > to be seen, though it seems to make it a bit more likely.
> >> 
> >> That would appear to just slow things down a smidge.    As there is
> >> nothing substantial that happens ptrace wise except until after
> >> zap_pid_ns_processes.
> >> 
> >> 
> >> > ---
> >> >
> >> > #define _GNU_SOURCE
> >> > #include <stdio.h>
> >> > #include <stdlib.h>
> >> > #include <unistd.h>
> >> > #include <sys/ptrace.h>
> >> > #include <errno.h>
> >> > #include <string.h>
> >> > #include <sched.h>
> >> >
> >> > #define STACK_SIZE 65536
> >> >
> >> > int child1(void* arg);
> >> > int child2(void* arg);
> >> >
> >> > int main(int argc, char **argv)
> >> > {
> >> >   int child_pid;
> >> >   char* child_stack = malloc(STACK_SIZE);
> >> >   char* stack_top = child_stack + STACK_SIZE;
> >> >   char command[256];
> >> >
> >> >   child_pid = clone(&child1, stack_top, CLONE_NEWPID, NULL);
> >> >   if (child_pid == -1) {
> >> >     printf("parent: clone failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >   printf("parent: child1_pid: %d\n", child_pid);
> >> >
> >> >   sleep(2);
> >> >   printf("child state, if it's D (disk sleep), the child process is hung\n");
> >> >   sprintf(command, "cat /proc/%d/status | grep State:", child_pid);
> >> >   system(command);
> >> >   sleep(3600);
> >> >   return EXIT_SUCCESS;
> >> > }
> >> >
> >> > int child1(void* arg)
> >> > {
> >> >   int flags = CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND | CLONE_THREAD;
> >> >   char* child_stack = malloc(STACK_SIZE);
> >> >   char* stack_top = child_stack + STACK_SIZE;
> >> >   long ret;
> >> >
> >> >   ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child1: ptrace failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >
> >> >   ret = clone(&child2, stack_top, flags, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child1: clone failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >   printf("child1: child2 pid: %ld\n", ret);
> >> >
> >> >   sleep(1);
> >> >   printf("child1: end\n");
> >> >   return EXIT_SUCCESS;
> >> > }
> >> >
> >> > int child2(void* arg)
> >> > {
> >> >   long ret = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
> >> >   if (ret == -1) {
> >> >     printf("child2: ptrace failed: %s\n", strerror(errno));
> >> >     return EXIT_FAILURE;
> >> >   }
> >> >
> >> >   printf("child2: end\n");
> >> >   return EXIT_SUCCESS;
> >> > }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 20:21   ` Guenter Roeck
@ 2017-05-11 21:25     ` Eric W. Biederman
  2017-05-11 22:47       ` Guenter Roeck
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-11 21:25 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > Hi all,
>> >
>> > the test program attached below almost always results in one of the child
>> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> > a single thread left in the pid namespace (the one that is stuck).
>> > Traceback from /proc/<pid>/stack is
>> >
>> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> > After 120 seconds, I get the "hung task" message.
>> >
>> > Example from v4.11:
>> >
>> > ...
>> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> > [ 3263.379561]       Not tainted 4.11.0+ #1
>> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>> > [ 3263.379587] Call Trace:
>> > [ 3263.379608]  __schedule+0x677/0xda0
>> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> > [ 3263.379643]  schedule+0x4d/0xd0
>> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> > ...
>> >
>> > The problem is seen in all kernels up to v4.11.
>> >
>> > Any idea what might be going on and how to fix the problem ?
>> 
>> Let me see.  Reading the code it looks like we have three tasks
>> let's call them main, child1, and child2.
>> 
>> child1 and child2 are started using CLONE_THREAD and are
>> thus clones of one another.
>> 
>> child2 exits first but is ptraced by main so is not reaped.
>>        Further child2 calls do_group_exit forcing child1 to
>>        exit making for fun races.
>> 
>>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>>        the group exit and make the window larger.
>> 
>> child1 exits next and calls zap_pid_ns_processes and is
>>        waiting for child2 to be reaped by main.
>> 
>> main is just sitting around doing nothing for 3600 seconds
>> not reaping anyone.
>> 
>> I would expect that when main exits everything would be cleaned up
>> and the only real issue is that we have a hung task warning.
>> 
>> Does everything cleanup when main exits?
>> 
>
> As an add-on to my previous mail: I added a function to count
> the number of threads in the pid namespace, using next_pidmap().
> Even though nr_hashed == 2, only the hanging thread is still
> present.

For your testcase?  I suspect you copied the code from
zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
calling zap_pid_ns_processes.

> Is there maybe a better way to terminate the wait loop than
> with "if (pid_ns->nr_hashed == init_pids)" ?

Not right now.  nr_hashed counts the number of "struct pid"s that are
in attached to processes and in the hash table for looking up tasks.

Waiting until that drops to just the last task in the pid namespace
really does do a good job of counting what we are looking for.  It
is a little tricky because if our task_pid(p) != task_tgid(p) then
nr_hashed will be 2.  For various reasons.

That said it isn't as obvious as it should be and I would like
to improve that.

I think the immediate solution here is to just use TASK_INTERRUPTIBLE
instead of TASK_UNITERRUPTIBLE.  This really is not a short term
disk sleep nor anything guaranteed to complete in any sort of
finite time.  Plus the hung task timeout warning only triggers
if your task state is TASK_UNINTERRUPTIBLE so this will prevent
the warning and keep a long delay in zap_pid_ns_processes from
bumping up your load average.

That still leaves the question of what state strange state
you are getting into but this should at least prevent the unexepcted
reboots.

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 77403c157157..971e7bc6939b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
         * if reparented.
         */
        for (;;) {
-               set_current_state(TASK_UNINTERRUPTIBLE);
+               set_current_state(TASK_INTERRUPTIBLE);
                if (pid_ns->nr_hashed == init_pids)
                        break;
                schedule();


Eric

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 20:48       ` Guenter Roeck
@ 2017-05-11 21:39         ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-11 21:39 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, May 11, 2017 at 03:23:17PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc/<pid>/stack is
>> >> >
>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>        Further child2 calls do_group_exit forcing child1 to
>> >>        exit making for fun races.
>> >> 
>> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>        the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>        waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> > Yes, it does.
>> >
>> > Problem is that if the main task doesn't exit, it hangs forever.
>> > Chrome OS (where we see the problem in the field, and the application
>> > is chrome) is configured to reboot on hung tasks - if a task is hung
>> > for 120 seconds on those systems, it tends to be in a bad shape.
>> > This makes it a quite severe problem for us.
>> 
>> We definitely need to change the code in such a way as to not
>> trigger the hung task warning.
>> 
>> > I browsed through a number of stack traces - the ones I looked at
>> > all have the same traceback from do_group_exit(), though the
>> > caller of do_group_exit() is not always the same (that may depend
>> > on the kernel version, though). Sometimes it is __wake_up_parent(),
>> > sometimes it is get_signal_to_deliver(), sometimes it is
>> > get_signal().
>> >
>> > Are there other conditions besides ptrace where a task isn't reaped ?
>> 
>> The problem here is that the parent was outside of the pid namespace
>> and was choosing not to reap the child.  The main task could easily
>> have reaped the child when it received SIGCHLD.
>> 
>> Arranging for forked children to run in a pid namespace with setns
>> can also create processes with parents outside the pid namespace,
>> that may or may not be reaped.
>> 
>> I suspect it is a bug to allow PTRACE_TRACEME if either the parent
>> or the child has called exec.  That is not enough to block your test
>> case but I think it would block real world problems.
>> 
>> For getting to the root cause of this I would see if I could get
>> a full process list.  It would be interesting to know what
>> the unreaped zombie is.
>> 
> root     11579 11472  0 13:38 pts/15   00:00:00 sudo ./clone
> root     11580 11579  0 13:38 pts/15   00:00:00 ./clone
> root     11581 11580  0 13:38 pts/15   00:00:00 [clone]
>
> I assume this means that it is child1 ? What other information
> would be needed ?

My question is does the test case cover the real life cause of this?
I assume it is simplified from what you are seeing on chromeOS.

More specifically the question is: Is it a process doing the
PTRACE_TRACEME?  A processing created after setns? Or something else.

There may be a bug I have not seen.
There may be something we can create defenses against.

Of course if it is just someone dreaming up evil test cases
where the kernel does an undesirable thing there is less to worry about.
But it sounds like something you have seen in the field that you
are trying to track down and understand.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 21:25     ` Eric W. Biederman
@ 2017-05-11 22:47       ` Guenter Roeck
  2017-05-11 23:19         ` Eric W. Biederman
  2017-05-12  3:42         ` Eric W. Biederman
  0 siblings, 2 replies; 28+ messages in thread
From: Guenter Roeck @ 2017-05-11 22:47 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck <linux@roeck-us.net> writes:
> >> 
> >> > Hi all,
> >> >
> >> > the test program attached below almost always results in one of the child
> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
> >> > a single thread left in the pid namespace (the one that is stuck).
> >> > Traceback from /proc/<pid>/stack is
> >> >
> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
> >> > [<ffffffffffffffff>] 0xffffffffffffffff
> >> >
> >> > After 120 seconds, I get the "hung task" message.
> >> >
> >> > Example from v4.11:
> >> >
> >> > ...
> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
> >> > [ 3263.379587] Call Trace:
> >> > [ 3263.379608]  __schedule+0x677/0xda0
> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
> >> > [ 3263.379643]  schedule+0x4d/0xd0
> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
> >> > ...
> >> >
> >> > The problem is seen in all kernels up to v4.11.
> >> >
> >> > Any idea what might be going on and how to fix the problem ?
> >> 
> >> Let me see.  Reading the code it looks like we have three tasks
> >> let's call them main, child1, and child2.
> >> 
> >> child1 and child2 are started using CLONE_THREAD and are
> >> thus clones of one another.
> >> 
> >> child2 exits first but is ptraced by main so is not reaped.
> >>        Further child2 calls do_group_exit forcing child1 to
> >>        exit making for fun races.
> >> 
> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
> >>        the group exit and make the window larger.
> >> 
> >> child1 exits next and calls zap_pid_ns_processes and is
> >>        waiting for child2 to be reaped by main.
> >> 
> >> main is just sitting around doing nothing for 3600 seconds
> >> not reaping anyone.
> >> 
> >> I would expect that when main exits everything would be cleaned up
> >> and the only real issue is that we have a hung task warning.
> >> 
> >> Does everything cleanup when main exits?
> >> 
> >
> > As an add-on to my previous mail: I added a function to count
> > the number of threads in the pid namespace, using next_pidmap().
> > Even though nr_hashed == 2, only the hanging thread is still
> > present.
> 
> For your testcase?  I suspect you copied the code from
> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
> calling zap_pid_ns_processes.
> 

Almost. Something along the line of

	count = 0;
	nr = next_pidmap(pid_ns, 0);
	while (nr > 0) {
		count++;
		nr = next_pidmap(pid_ns, nr);
	}

only I also call sched_show_task() for each thread, and the only
one printed is the one that hangs in zap_pid_ns_processes().

> > Is there maybe a better way to terminate the wait loop than
> > with "if (pid_ns->nr_hashed == init_pids)" ?
> 
> Not right now.  nr_hashed counts the number of "struct pid"s that are
> in attached to processes and in the hash table for looking up tasks.
> 
> Waiting until that drops to just the last task in the pid namespace
> really does do a good job of counting what we are looking for.  It
> is a little tricky because if our task_pid(p) != task_tgid(p) then
> nr_hashed will be 2.  For various reasons.
> 
> That said it isn't as obvious as it should be and I would like
> to improve that.
> 
> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
> disk sleep nor anything guaranteed to complete in any sort of
> finite time.  Plus the hung task timeout warning only triggers
> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
> the warning and keep a long delay in zap_pid_ns_processes from
> bumping up your load average.
> 
> That still leaves the question of what state strange state
> you are getting into but this should at least prevent the unexepcted
> reboots.
> 

What I know so far is 
- We see this condition on a regular basis in the field. Regular is
  relative, of course - let's say maybe 1 in a Milion Chromebooks
  per day reports a crash because of it. That is not that many,
  but it adds up.
- We are able to reproduce the problem with a performance benchmark
  which opens 100 chrome tabs. While that is a lot, it should not
  result in a kernel hang/crash.
- Vovo proviced the test code last night. I don't know if this is
  exactly what is observed in the benchmark, or how it relates to the
  benchmark in the first place, but it is the first time we are actually
  able to reliably create a condition where the problem is seen.

> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 77403c157157..971e7bc6939b 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>          * if reparented.
>          */
>         for (;;) {
> -               set_current_state(TASK_UNINTERRUPTIBLE);
> +               set_current_state(TASK_INTERRUPTIBLE);
>                 if (pid_ns->nr_hashed == init_pids)
>                         break;
>                 schedule();
> 

I'll give it a try. Are there any potential unintended side effects ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 22:47       ` Guenter Roeck
@ 2017-05-11 23:19         ` Eric W. Biederman
  2017-05-12  9:30           ` Vovo Yang
  2017-05-12  3:42         ` Eric W. Biederman
  1 sibling, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-11 23:19 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >> 
>> >> > Hi all,
>> >> >
>> >> > the test program attached below almost always results in one of the child
>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>> >> > a single thread left in the pid namespace (the one that is stuck).
>> >> > Traceback from /proc/<pid>/stack is
>> >> >
>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>> >> >
>> >> > After 120 seconds, I get the "hung task" message.
>> >> >
>> >> > Example from v4.11:
>> >> >
>> >> > ...
>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>> >> > [ 3263.379587] Call Trace:
>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>> >> > ...
>> >> >
>> >> > The problem is seen in all kernels up to v4.11.
>> >> >
>> >> > Any idea what might be going on and how to fix the problem ?
>> >> 
>> >> Let me see.  Reading the code it looks like we have three tasks
>> >> let's call them main, child1, and child2.
>> >> 
>> >> child1 and child2 are started using CLONE_THREAD and are
>> >> thus clones of one another.
>> >> 
>> >> child2 exits first but is ptraced by main so is not reaped.
>> >>        Further child2 calls do_group_exit forcing child1 to
>> >>        exit making for fun races.
>> >> 
>> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>> >>        the group exit and make the window larger.
>> >> 
>> >> child1 exits next and calls zap_pid_ns_processes and is
>> >>        waiting for child2 to be reaped by main.
>> >> 
>> >> main is just sitting around doing nothing for 3600 seconds
>> >> not reaping anyone.
>> >> 
>> >> I would expect that when main exits everything would be cleaned up
>> >> and the only real issue is that we have a hung task warning.
>> >> 
>> >> Does everything cleanup when main exits?
>> >> 
>> >
>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
> 	count = 0;
> 	nr = next_pidmap(pid_ns, 0);
> 	while (nr > 0) {
> 		count++;
> 		nr = next_pidmap(pid_ns, nr);
> 	}
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().
>
>> > Is there maybe a better way to terminate the wait loop than
>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>> 
>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>> in attached to processes and in the hash table for looking up tasks.
>> 
>> Waiting until that drops to just the last task in the pid namespace
>> really does do a good job of counting what we are looking for.  It
>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>> nr_hashed will be 2.  For various reasons.
>> 
>> That said it isn't as obvious as it should be and I would like
>> to improve that.
>> 
>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>> disk sleep nor anything guaranteed to complete in any sort of
>> finite time.  Plus the hung task timeout warning only triggers
>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>> the warning and keep a long delay in zap_pid_ns_processes from
>> bumping up your load average.
>> 
>> That still leaves the question of what state strange state
>> you are getting into but this should at least prevent the unexepcted
>> reboots.
>> 
>
> What I know so far is 
> - We see this condition on a regular basis in the field. Regular is
>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>   per day reports a crash because of it. That is not that many,
>   but it adds up.
> - We are able to reproduce the problem with a performance benchmark
>   which opens 100 chrome tabs. While that is a lot, it should not
>   result in a kernel hang/crash.
> - Vovo proviced the test code last night. I don't know if this is
>   exactly what is observed in the benchmark, or how it relates to the
>   benchmark in the first place, but it is the first time we are actually
>   able to reliably create a condition where the problem is seen.

Thank you.  I will be interesting to hear what is happening in the
chrome perfomance benchmark that triggers this.

There may be some other steps that are worth taking.

>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>> index 77403c157157..971e7bc6939b 100644
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>          * if reparented.
>>          */
>>         for (;;) {
>> -               set_current_state(TASK_UNINTERRUPTIBLE);
>> +               set_current_state(TASK_INTERRUPTIBLE);
>>                 if (pid_ns->nr_hashed == init_pids)
>>                         break;
>>                 schedule();
>> 
>
> I'll give it a try. Are there any potential unintended side effects ?

No.  The only reason I didn't implement this way in the first place is
that I have cognitive disssonance between the name of that state and
the fact the code isn't interruptible.

In the long term it is probably better to refactor that code path so it
doesn't exist in the first place, and use more common kernel idioms.
But for right now it should not be a problem.


Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 22:47       ` Guenter Roeck
  2017-05-11 23:19         ` Eric W. Biederman
@ 2017-05-12  3:42         ` Eric W. Biederman
  1 sibling, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12  3:42 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Ingo Molnar, linux-kernel, Vovo Yang

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:

>> > As an add-on to my previous mail: I added a function to count
>> > the number of threads in the pid namespace, using next_pidmap().
>> > Even though nr_hashed == 2, only the hanging thread is still
>> > present.
>> 
>> For your testcase?  I suspect you copied the code from
>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>> calling zap_pid_ns_processes.
>> 
>
> Almost. Something along the line of
>
> 	count = 0;
> 	nr = next_pidmap(pid_ns, 0);
> 	while (nr > 0) {
> 		count++;
> 		nr = next_pidmap(pid_ns, nr);
> 	}
>
> only I also call sched_show_task() for each thread, and the only
> one printed is the one that hangs in zap_pid_ns_processes().

The function sched_show_task() does:
	if (!try_get_task_stack(p))
        	return;
                
Which won't work on a zombie who has already released it's stack.
Which is exactly what child2 should be at that point.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-11 23:19         ` Eric W. Biederman
@ 2017-05-12  9:30           ` Vovo Yang
  2017-05-12 13:26             ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Vovo Yang @ 2017-05-12  9:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Guenter Roeck, Ingo Molnar, linux-kernel

On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
>
>> On Thu, May 11, 2017 at 04:25:23PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck <linux@roeck-us.net> writes:
>>>
>>> > On Thu, May 11, 2017 at 12:31:21PM -0500, Eric W. Biederman wrote:
>>> >> Guenter Roeck <linux@roeck-us.net> writes:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > the test program attached below almost always results in one of the child
>>> >> > processes being stuck in zap_pid_ns_processes(). When this happens, I can
>>> >> > see from test logs that nr_hashed == 2 and init_pids==1, but there is only
>>> >> > a single thread left in the pid namespace (the one that is stuck).
>>> >> > Traceback from /proc/<pid>/stack is
>>> >> >
>>> >> > [<ffffffff811c385e>] zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [<ffffffff810c1ba4>] do_exit+0x10d4/0x1330
>>> >> > [<ffffffff810c1ee6>] do_group_exit+0x86/0x130
>>> >> > [<ffffffff810d4347>] get_signal+0x367/0x8a0
>>> >> > [<ffffffff81046e73>] do_signal+0x83/0xb90
>>> >> > [<ffffffff81004475>] exit_to_usermode_loop+0x75/0xc0
>>> >> > [<ffffffff810055b6>] syscall_return_slowpath+0xc6/0xd0
>>> >> > [<ffffffff81ced488>] entry_SYSCALL_64_fastpath+0xab/0xad
>>> >> > [<ffffffffffffffff>] 0xffffffffffffffff
>>> >> >
>>> >> > After 120 seconds, I get the "hung task" message.
>>> >> >
>>> >> > Example from v4.11:
>>> >> >
>>> >> > ...
>>> >> > [ 3263.379545] INFO: task clone:27910 blocked for more than 120 seconds.
>>> >> > [ 3263.379561]       Not tainted 4.11.0+ #1
>>> >> > [ 3263.379569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> >> > [ 3263.379577] clone           D    0 27910  27909 0x00000000
>>> >> > [ 3263.379587] Call Trace:
>>> >> > [ 3263.379608]  __schedule+0x677/0xda0
>>> >> > [ 3263.379621]  ? pci_mmcfg_check_reserved+0xc0/0xc0
>>> >> > [ 3263.379634]  ? task_stopped_code+0x70/0x70
>>> >> > [ 3263.379643]  schedule+0x4d/0xd0
>>> >> > [ 3263.379653]  zap_pid_ns_processes+0x1ee/0x2a0
>>> >> > [ 3263.379659]  ? copy_pid_ns+0x4d0/0x4d0
>>> >> > [ 3263.379670]  do_exit+0x10d4/0x1330
>>> >> > ...
>>> >> >
>>> >> > The problem is seen in all kernels up to v4.11.
>>> >> >
>>> >> > Any idea what might be going on and how to fix the problem ?
>>> >>
>>> >> Let me see.  Reading the code it looks like we have three tasks
>>> >> let's call them main, child1, and child2.
>>> >>
>>> >> child1 and child2 are started using CLONE_THREAD and are
>>> >> thus clones of one another.
>>> >>
>>> >> child2 exits first but is ptraced by main so is not reaped.
>>> >>        Further child2 calls do_group_exit forcing child1 to
>>> >>        exit making for fun races.
>>> >>
>>> >>        A ptread_exit() or syscall(SYS_exit, 0); would skip
>>> >>        the group exit and make the window larger.
>>> >>
>>> >> child1 exits next and calls zap_pid_ns_processes and is
>>> >>        waiting for child2 to be reaped by main.
>>> >>
>>> >> main is just sitting around doing nothing for 3600 seconds
>>> >> not reaping anyone.
>>> >>
>>> >> I would expect that when main exits everything would be cleaned up
>>> >> and the only real issue is that we have a hung task warning.
>>> >>
>>> >> Does everything cleanup when main exits?
>>> >>
>>> >
>>> > As an add-on to my previous mail: I added a function to count
>>> > the number of threads in the pid namespace, using next_pidmap().
>>> > Even though nr_hashed == 2, only the hanging thread is still
>>> > present.
>>>
>>> For your testcase?  I suspect you copied the code from
>>> zap_pid_ns_processes and skipped pid 1.  It is going to be pid 1 that is
>>> calling zap_pid_ns_processes.
>>>
>>
>> Almost. Something along the line of
>>
>>       count = 0;
>>       nr = next_pidmap(pid_ns, 0);
>>       while (nr > 0) {
>>               count++;
>>               nr = next_pidmap(pid_ns, nr);
>>       }
>>
>> only I also call sched_show_task() for each thread, and the only
>> one printed is the one that hangs in zap_pid_ns_processes().
>>
>>> > Is there maybe a better way to terminate the wait loop than
>>> > with "if (pid_ns->nr_hashed == init_pids)" ?
>>>
>>> Not right now.  nr_hashed counts the number of "struct pid"s that are
>>> in attached to processes and in the hash table for looking up tasks.
>>>
>>> Waiting until that drops to just the last task in the pid namespace
>>> really does do a good job of counting what we are looking for.  It
>>> is a little tricky because if our task_pid(p) != task_tgid(p) then
>>> nr_hashed will be 2.  For various reasons.
>>>
>>> That said it isn't as obvious as it should be and I would like
>>> to improve that.
>>>
>>> I think the immediate solution here is to just use TASK_INTERRUPTIBLE
>>> instead of TASK_UNITERRUPTIBLE.  This really is not a short term
>>> disk sleep nor anything guaranteed to complete in any sort of
>>> finite time.  Plus the hung task timeout warning only triggers
>>> if your task state is TASK_UNINTERRUPTIBLE so this will prevent
>>> the warning and keep a long delay in zap_pid_ns_processes from
>>> bumping up your load average.
>>>
>>> That still leaves the question of what state strange state
>>> you are getting into but this should at least prevent the unexepcted
>>> reboots.
>>>
>>
>> What I know so far is
>> - We see this condition on a regular basis in the field. Regular is
>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>   per day reports a crash because of it. That is not that many,
>>   but it adds up.
>> - We are able to reproduce the problem with a performance benchmark
>>   which opens 100 chrome tabs. While that is a lot, it should not
>>   result in a kernel hang/crash.
>> - Vovo proviced the test code last night. I don't know if this is
>>   exactly what is observed in the benchmark, or how it relates to the
>>   benchmark in the first place, but it is the first time we are actually
>>   able to reliably create a condition where the problem is seen.
>
> Thank you.  I will be interesting to hear what is happening in the
> chrome perfomance benchmark that triggers this.
>

What's happening in the benchmark:
1. A chrome renderer process was created with CLONE_NEWPID
2. The process crashed
3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
  threads of the crashed process to dump info
4. When breakpad detach the crashed process, the crashed process stuck in
  zap_pid_ns_processes()

Thanks,
Vovo

> There may be some other steps that are worth taking.
>
>>> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
>>> index 77403c157157..971e7bc6939b 100644
>>> --- a/kernel/pid_namespace.c
>>> +++ b/kernel/pid_namespace.c
>>> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>>          * if reparented.
>>>          */
>>>         for (;;) {
>>> -               set_current_state(TASK_UNINTERRUPTIBLE);
>>> +               set_current_state(TASK_INTERRUPTIBLE);
>>>                 if (pid_ns->nr_hashed == init_pids)
>>>                         break;
>>>                 schedule();
>>>
>>
>> I'll give it a try. Are there any potential unintended side effects ?
>
> No.  The only reason I didn't implement this way in the first place is
> that I have cognitive disssonance between the name of that state and
> the fact the code isn't interruptible.
>
> In the long term it is probably better to refactor that code path so it
> doesn't exist in the first place, and use more common kernel idioms.
> But for right now it should not be a problem.
>
>
> Eric
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12  9:30           ` Vovo Yang
@ 2017-05-12 13:26             ` Eric W. Biederman
  2017-05-12 16:52               ` Guenter Roeck
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12 13:26 UTC (permalink / raw)
  To: Vovo Yang; +Cc: Guenter Roeck, Ingo Molnar, linux-kernel

Vovo Yang <vovoy@google.com> writes:

> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>>
>>> What I know so far is
>>> - We see this condition on a regular basis in the field. Regular is
>>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>   per day reports a crash because of it. That is not that many,
>>>   but it adds up.
>>> - We are able to reproduce the problem with a performance benchmark
>>>   which opens 100 chrome tabs. While that is a lot, it should not
>>>   result in a kernel hang/crash.
>>> - Vovo proviced the test code last night. I don't know if this is
>>>   exactly what is observed in the benchmark, or how it relates to the
>>>   benchmark in the first place, but it is the first time we are actually
>>>   able to reliably create a condition where the problem is seen.
>>
>> Thank you.  I will be interesting to hear what is happening in the
>> chrome perfomance benchmark that triggers this.
>>
> What's happening in the benchmark:
> 1. A chrome renderer process was created with CLONE_NEWPID
> 2. The process crashed
> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>   threads of the crashed process to dump info
> 4. When breakpad detach the crashed process, the crashed process stuck in
>   zap_pid_ns_processes()

Very interesting thank you.

So the question is specifically which interaction is causing this.

In the test case provided it was a sibling task in the pid namespace
dying and not being reaped.  Which may be what is happening with
breakpad.  So far I have yet to see kernel bug but I won't rule one out.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12 13:26             ` Eric W. Biederman
@ 2017-05-12 16:52               ` Guenter Roeck
  2017-05-12 17:33                 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-12 16:52 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Hi Eric,

On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> Vovo Yang <vovoy@google.com> writes:
> 
> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> > <ebiederm@xmission.com> wrote:
> >> Guenter Roeck <linux@roeck-us.net> writes:
> >>
> >>> What I know so far is
> >>> - We see this condition on a regular basis in the field. Regular is
> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >>>   per day reports a crash because of it. That is not that many,
> >>>   but it adds up.
> >>> - We are able to reproduce the problem with a performance benchmark
> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >>>   result in a kernel hang/crash.
> >>> - Vovo proviced the test code last night. I don't know if this is
> >>>   exactly what is observed in the benchmark, or how it relates to the
> >>>   benchmark in the first place, but it is the first time we are actually
> >>>   able to reliably create a condition where the problem is seen.
> >>
> >> Thank you.  I will be interesting to hear what is happening in the
> >> chrome perfomance benchmark that triggers this.
> >>
> > What's happening in the benchmark:
> > 1. A chrome renderer process was created with CLONE_NEWPID
> > 2. The process crashed
> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
> >   threads of the crashed process to dump info
> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >   zap_pid_ns_processes()
> 
> Very interesting thank you.
> 
> So the question is specifically which interaction is causing this.
> 
> In the test case provided it was a sibling task in the pid namespace
> dying and not being reaped.  Which may be what is happening with
> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> 

I am trying to understand what you are looking for. I would have thought
that both the test application as well as the Chrome functionality
described above show that there are situations where zap_pid_ns_processes()
can get stuck and cause hung task timeouts in conjunction with the use of
ptrace().

Your last sentence seems to suggest that you believe that the kernel might
do what it is expected to do. Assuming this is the case, what else would
you like to see ? A test application which matches exactly the Chrome use
case ? We can try to provide that, but I don't entirely understand how
that would change the situation. After all, we already know that it is
possible to get a thread into this condition, and we already have one
means to reproduce it.

Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
test application and the Chrome benchmark. The thread is still stuck in
zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
and no longer results in a hung task timeout. It remains in that state
until the parent process terminates. I am not entirely happy with it
since the processes are still stuck and may pile up over time, but at
least it solves the immediate problem for us.

Question now is what to do with that solution. We can of course apply
it locally to Chrome OS, but I would rather have it upstream - especially
since we have to assume that any users of Chrome on Linux, or more
generically anyone using ptrace in conjunction with CLONE_NEWPID, may
experience the same problem. Right now I have no idea how to get there,
though. Can you provide some guidance ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12 16:52               ` Guenter Roeck
@ 2017-05-12 17:33                 ` Eric W. Biederman
       [not found]                   ` <874lwqyo8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2017-05-12 19:43                   ` Threads stuck in zap_pid_ns_processes() Guenter Roeck
  0 siblings, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12 17:33 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang <vovoy@google.com> writes:
>> 
>> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> > <ebiederm@xmission.com> wrote:
>> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >>
>> >>> What I know so far is
>> >>> - We see this condition on a regular basis in the field. Regular is
>> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >>>   per day reports a crash because of it. That is not that many,
>> >>>   but it adds up.
>> >>> - We are able to reproduce the problem with a performance benchmark
>> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >>>   result in a kernel hang/crash.
>> >>> - Vovo proviced the test code last night. I don't know if this is
>> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >>>   benchmark in the first place, but it is the first time we are actually
>> >>>   able to reliably create a condition where the problem is seen.
>> >>
>> >> Thank you.  I will be interesting to hear what is happening in the
>> >> chrome perfomance benchmark that triggers this.
>> >>
>> > What's happening in the benchmark:
>> > 1. A chrome renderer process was created with CLONE_NEWPID
>> > 2. The process crashed
>> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>> >   threads of the crashed process to dump info
>> > 4. When breakpad detach the crashed process, the crashed process stuck in
>> >   zap_pid_ns_processes()
>> 
>> Very interesting thank you.
>> 
>> So the question is specifically which interaction is causing this.
>> 
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped.  Which may be what is happening with
>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> 
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?

Apologies for not being clear.  I intend to send a pull request with the
the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag.  So the
fix can be backported.

I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see.  It won't
remove the ``stuck'' zombies.

So what I am looking for now is:
Why are things getting stuck in your benchmark?

-  Is it a userspace bug?

  In which case we can figure out what userspace (aka breakpad) needs
   to do to avoid the problem.
   
-  Is it a kernel bug with ptrace?

   There have been a lot of little subtle bugs with ptrace over the
   years so one more would not surprise

So I am just looking to make certain we fix the root issue not just
the hung task timeout warning.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
  2017-05-12 17:33                 ` Eric W. Biederman
@ 2017-05-12 17:55                       ` Eric W. Biederman
  2017-05-12 19:43                   ` Threads stuck in zap_pid_ns_processes() Guenter Roeck
  1 sibling, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12 17:55 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Ingo Molnar, Vovo Yang

Date: Thu, 11 May 2017 18:21:01 -0500

The code can potentially sleep for an indefinite amount of time in
zap_pid_ns_processes triggering the hung task timeout, and increasing
the system average.  This is undesirable.  Sleep with a task state of
TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
undesirable side effects.

Apparently under heavy load this has been allowing Chrome to trigger
the hung time task timeout error and cause ChromeOS to reboot.

Reported-by: Vovo Yang <vovoy-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reported-by: Guenter Roeck <linux-0h96xk9xTtrk1uMJSBkQmQ@public.gmane.org>
Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---

This is what I have queued up posting so it is public knowledge and so
that if anyone can see a flaw it can get fixed.

 kernel/pid_namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index de461aa0bf9a..6e51b8820495 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 * if reparented.
 	 */
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE);
 		if (pid_ns->nr_hashed == init_pids)
 			break;
 		schedule();
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
@ 2017-05-12 17:55                       ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12 17:55 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel, Linux Containers

Date: Thu, 11 May 2017 18:21:01 -0500

The code can potentially sleep for an indefinite amount of time in
zap_pid_ns_processes triggering the hung task timeout, and increasing
the system average.  This is undesirable.  Sleep with a task state of
TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
undesirable side effects.

Apparently under heavy load this has been allowing Chrome to trigger
the hung time task timeout error and cause ChromeOS to reboot.

Reported-by: Vovo Yang <vovoy@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

This is what I have queued up posting so it is public knowledge and so
that if anyone can see a flaw it can get fixed.

 kernel/pid_namespace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index de461aa0bf9a..6e51b8820495 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	 * if reparented.
 	 */
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_INTERRUPTIBLE);
 		if (pid_ns->nr_hashed == init_pids)
 			break;
 		schedule();
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
  2017-05-12 17:55                       ` Eric W. Biederman
@ 2017-05-12 19:33                           ` Guenter Roeck
  -1 siblings, 0 replies; 28+ messages in thread
From: Guenter Roeck @ 2017-05-12 19:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Ingo Molnar, Vovo Yang

On Fri, May 12, 2017 at 12:55:22PM -0500, Eric W. Biederman wrote:
> Date: Thu, 11 May 2017 18:21:01 -0500
> 
> The code can potentially sleep for an indefinite amount of time in
> zap_pid_ns_processes triggering the hung task timeout, and increasing
> the system average.  This is undesirable.  Sleep with a task state of
> TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
> undesirable side effects.
> 
> Apparently under heavy load this has been allowing Chrome to trigger
> the hung time task timeout error and cause ChromeOS to reboot.
> 
> Reported-by: Vovo Yang <vovoy-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: Guenter Roeck <linux-0h96xk9xTtrk1uMJSBkQmQ@public.gmane.org>

Tested-by: Guenter Roeck <linux-0h96xk9xTtrk1uMJSBkQmQ@public.gmane.org>

> Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
> 
> This is what I have queued up posting so it is public knowledge and so
> that if anyone can see a flaw it can get fixed.
> 
>  kernel/pid_namespace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index de461aa0bf9a..6e51b8820495 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 * if reparented.
>  	 */
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		set_current_state(TASK_INTERRUPTIBLE);
>  		if (pid_ns->nr_hashed == init_pids)
>  			break;
>  		schedule();
> -- 
> 2.10.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
@ 2017-05-12 19:33                           ` Guenter Roeck
  0 siblings, 0 replies; 28+ messages in thread
From: Guenter Roeck @ 2017-05-12 19:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel, Linux Containers

On Fri, May 12, 2017 at 12:55:22PM -0500, Eric W. Biederman wrote:
> Date: Thu, 11 May 2017 18:21:01 -0500
> 
> The code can potentially sleep for an indefinite amount of time in
> zap_pid_ns_processes triggering the hung task timeout, and increasing
> the system average.  This is undesirable.  Sleep with a task state of
> TASK_INTERRUPTIBLE instead of TASK_UNINTERRUPTIBLE to remove these
> undesirable side effects.
> 
> Apparently under heavy load this has been allowing Chrome to trigger
> the hung time task timeout error and cause ChromeOS to reboot.
> 
> Reported-by: Vovo Yang <vovoy@google.com>
> Reported-by: Guenter Roeck <linux@roeck-us.net>

Tested-by: Guenter Roeck <linux@roeck-us.net>

> Fixes: 6347e9009104 ("pidns: guarantee that the pidns init will be the last pidns process reaped")
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> This is what I have queued up posting so it is public knowledge and so
> that if anyone can see a flaw it can get fixed.
> 
>  kernel/pid_namespace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index de461aa0bf9a..6e51b8820495 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -277,7 +277,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	 * if reparented.
>  	 */
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		set_current_state(TASK_INTERRUPTIBLE);
>  		if (pid_ns->nr_hashed == init_pids)
>  			break;
>  		schedule();
> -- 
> 2.10.1
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12 17:33                 ` Eric W. Biederman
       [not found]                   ` <874lwqyo8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2017-05-12 19:43                   ` Guenter Roeck
  2017-05-12 20:03                     ` Eric W. Biederman
  1 sibling, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-05-12 19:43 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Hi Eric,

On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > Hi Eric,
> >
> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
> >> Vovo Yang <vovoy@google.com> writes:
> >> 
> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
> >> > <ebiederm@xmission.com> wrote:
> >> >> Guenter Roeck <linux@roeck-us.net> writes:
> >> >>
> >> >>> What I know so far is
> >> >>> - We see this condition on a regular basis in the field. Regular is
> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
> >> >>>   per day reports a crash because of it. That is not that many,
> >> >>>   but it adds up.
> >> >>> - We are able to reproduce the problem with a performance benchmark
> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
> >> >>>   result in a kernel hang/crash.
> >> >>> - Vovo proviced the test code last night. I don't know if this is
> >> >>>   exactly what is observed in the benchmark, or how it relates to the
> >> >>>   benchmark in the first place, but it is the first time we are actually
> >> >>>   able to reliably create a condition where the problem is seen.
> >> >>
> >> >> Thank you.  I will be interesting to hear what is happening in the
> >> >> chrome perfomance benchmark that triggers this.
> >> >>
> >> > What's happening in the benchmark:
> >> > 1. A chrome renderer process was created with CLONE_NEWPID
> >> > 2. The process crashed
> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
> >> >   threads of the crashed process to dump info
> >> > 4. When breakpad detach the crashed process, the crashed process stuck in
> >> >   zap_pid_ns_processes()
> >> 
> >> Very interesting thank you.
> >> 
> >> So the question is specifically which interaction is causing this.
> >> 
> >> In the test case provided it was a sibling task in the pid namespace
> >> dying and not being reaped.  Which may be what is happening with
> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
> >> 
> >
> > I am trying to understand what you are looking for. I would have thought
> > that both the test application as well as the Chrome functionality
> > described above show that there are situations where zap_pid_ns_processes()
> > can get stuck and cause hung task timeouts in conjunction with the use of
> > ptrace().
> >
> > Your last sentence seems to suggest that you believe that the kernel might
> > do what it is expected to do. Assuming this is the case, what else would
> > you like to see ? A test application which matches exactly the Chrome use
> > case ? We can try to provide that, but I don't entirely understand how
> > that would change the situation. After all, we already know that it is
> > possible to get a thread into this condition, and we already have one
> > means to reproduce it.
> >
> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> > test application and the Chrome benchmark. The thread is still stuck in
> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> > and no longer results in a hung task timeout. It remains in that state
> > until the parent process terminates. I am not entirely happy with it
> > since the processes are still stuck and may pile up over time, but at
> > least it solves the immediate problem for us.
> >
> > Question now is what to do with that solution. We can of course apply
> > it locally to Chrome OS, but I would rather have it upstream - especially
> > since we have to assume that any users of Chrome on Linux, or more
> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> > experience the same problem. Right now I have no idea how to get there,
> > though. Can you provide some guidance ?
> 
> Apologies for not being clear.  I intend to send a pull request with the

No worries.

> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
> next week or so with a Cc stable and an appropriate Fixes tag.  So the
> fix can be backported.
> 
Great, that is good to know.

> I have a more comprehensive change queued I will probably merge for 4.13
> already but it just changes what kind of zombies you see.  It won't
> remove the ``stuck'' zombies.
> 
> So what I am looking for now is:
> Why are things getting stuck in your benchmark?
> 
> -  Is it a userspace bug?
> 
>   In which case we can figure out what userspace (aka breakpad) needs
>    to do to avoid the problem.
>    
I spent some time trying to understand what breakpad is doing. I don't
really see how it can be blamed. It attaches itself to a crashing thread,
gets the information it is looking for, and detaches itself. At the
same time, whatever it does, it seems to me that userspace should not
be able to create such a situation in the first place.

> -  Is it a kernel bug with ptrace?
> 
>    There have been a lot of little subtle bugs with ptrace over the
>    years so one more would not surprise
> 
Good question. I suspect that PTRACE_ATTACH (or PTRACE_TRACEME)
changes the reaper to be the calling process, and a subsequent
PTRACE_DETACH or exit of the process calling PTRACE_TRACEME doesn't
change it back or changes it to the process group owner. Is that
what happens, and if so is it what should happen ? I don't know (yet).

> So I am just looking to make certain we fix the root issue not just
> the hung task timeout warning.
> 
Makes sense. At least with TASK_INTERRUPTIBLE, I suspect we are just
fixing the symptom, but on the other side that isn't really my area
of expertise, so I don't really know.

One additional data point: If I time the threads in the test code by
adding various calls to sleep(), it is still more or less random if the
task status ends up in Z or D, meaning the child thread is sometimes
reaped and sometimes not. I would have expected it to be well defined.
I may be wrong, but that smells racy to me.

Anyway, I'll spend some more time on this. I'll let you know if I find
anything.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12 19:43                   ` Threads stuck in zap_pid_ns_processes() Guenter Roeck
@ 2017-05-12 20:03                     ` Eric W. Biederman
  2017-05-13 14:34                       ` Guenter Roeck
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-12 20:03 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:

> Hi Eric,
>
> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > Hi Eric,
>> >
>> > On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> >> Vovo Yang <vovoy@google.com> writes:
>> >> 
>> >> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> >> > <ebiederm@xmission.com> wrote:
>> >> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >> >>
>> >> >>> What I know so far is
>> >> >>> - We see this condition on a regular basis in the field. Regular is
>> >> >>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >> >>>   per day reports a crash because of it. That is not that many,
>> >> >>>   but it adds up.
>> >> >>> - We are able to reproduce the problem with a performance benchmark
>> >> >>>   which opens 100 chrome tabs. While that is a lot, it should not
>> >> >>>   result in a kernel hang/crash.
>> >> >>> - Vovo proviced the test code last night. I don't know if this is
>> >> >>>   exactly what is observed in the benchmark, or how it relates to the
>> >> >>>   benchmark in the first place, but it is the first time we are actually
>> >> >>>   able to reliably create a condition where the problem is seen.
>> >> >>
>> >> >> Thank you.  I will be interesting to hear what is happening in the
>> >> >> chrome perfomance benchmark that triggers this.
>> >> >>
>> >> > What's happening in the benchmark:
>> >> > 1. A chrome renderer process was created with CLONE_NEWPID
>> >> > 2. The process crashed
>> >> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>> >> >   threads of the crashed process to dump info
>> >> > 4. When breakpad detach the crashed process, the crashed process stuck in
>> >> >   zap_pid_ns_processes()
>> >> 
>> >> Very interesting thank you.
>> >> 
>> >> So the question is specifically which interaction is causing this.
>> >> 
>> >> In the test case provided it was a sibling task in the pid namespace
>> >> dying and not being reaped.  Which may be what is happening with
>> >> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>> >> 
>> >
>> > I am trying to understand what you are looking for. I would have thought
>> > that both the test application as well as the Chrome functionality
>> > described above show that there are situations where zap_pid_ns_processes()
>> > can get stuck and cause hung task timeouts in conjunction with the use of
>> > ptrace().
>> >
>> > Your last sentence seems to suggest that you believe that the kernel might
>> > do what it is expected to do. Assuming this is the case, what else would
>> > you like to see ? A test application which matches exactly the Chrome use
>> > case ? We can try to provide that, but I don't entirely understand how
>> > that would change the situation. After all, we already know that it is
>> > possible to get a thread into this condition, and we already have one
>> > means to reproduce it.
>> >
>> > Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>> > test application and the Chrome benchmark. The thread is still stuck in
>> > zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>> > and no longer results in a hung task timeout. It remains in that state
>> > until the parent process terminates. I am not entirely happy with it
>> > since the processes are still stuck and may pile up over time, but at
>> > least it solves the immediate problem for us.
>> >
>> > Question now is what to do with that solution. We can of course apply
>> > it locally to Chrome OS, but I would rather have it upstream - especially
>> > since we have to assume that any users of Chrome on Linux, or more
>> > generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>> > experience the same problem. Right now I have no idea how to get there,
>> > though. Can you provide some guidance ?
>> 
>> Apologies for not being clear.  I intend to send a pull request with the
>
> No worries.
>
>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>> next week or so with a Cc stable and an appropriate Fixes tag.  So the
>> fix can be backported.
>> 
> Great, that is good to know.
>
>> I have a more comprehensive change queued I will probably merge for 4.13
>> already but it just changes what kind of zombies you see.  It won't
>> remove the ``stuck'' zombies.
>> 
>> So what I am looking for now is:
>> Why are things getting stuck in your benchmark?
>> 
>> -  Is it a userspace bug?
>> 
>>   In which case we can figure out what userspace (aka breakpad) needs
>>    to do to avoid the problem.
>>    
> I spent some time trying to understand what breakpad is doing. I don't
> really see how it can be blamed. It attaches itself to a crashing thread,
> gets the information it is looking for, and detaches itself. At the
> same time, whatever it does, it seems to me that userspace should not
> be able to create such a situation in the first place.

I suspect what is happening is that the thread breakpad is attached to
is killed by zap_pid_ns_processes.  At which point PTRACE_DETACH won't
work because the thread has been killed and waitpid(thread_pid,...)
needs to be used instead.

Certainly that is what is happening in the reproducer.

So I suspect breakpad just needs a waitpid whend PTRACE_DETACH fails
or possible just a general loop that listens for SIGCHLD and calls
some variant of wait.

>> -  Is it a kernel bug with ptrace?
>> 
>>    There have been a lot of little subtle bugs with ptrace over the
>>    years so one more would not surprise
>> 
> Good question. I suspect that PTRACE_ATTACH (or PTRACE_TRACEME)
> changes the reaper to be the calling process, and a subsequent
> PTRACE_DETACH or exit of the process calling PTRACE_TRACEME doesn't
> change it back or changes it to the process group owner. Is that
> what happens, and if so is it what should happen ? I don't know (yet).

Not the reaper but ptrace does change the parent to be the tracer.
PTRACE_DETACH won't work if the process is not stopped waiting
for the tracer.  And being killed and being a zombie is the ultimate
not waiting for the tracer state.

What needs to happen is for the tracer (aka breakpad) to call waitpid.

>> So I am just looking to make certain we fix the root issue not just
>> the hung task timeout warning.
>> 
> Makes sense. At least with TASK_INTERRUPTIBLE, I suspect we are just
> fixing the symptom, but on the other side that isn't really my area
> of expertise, so I don't really know.
>
> One additional data point: If I time the threads in the test code by
> adding various calls to sleep(), it is still more or less random if the
> task status ends up in Z or D, meaning the child thread is sometimes
> reaped and sometimes not. I would have expected it to be well defined.
> I may be wrong, but that smells racy to me.

If you have the threads exit with the exit system call rather than
what in the kernel winds up being do_group_exit it won't be racy.
The easy version is to use pthreads from child1 to create child2
and have them exit with pthread_exit.

Alternatively you can wrap call syscall(SYS_exit, exit_code); which
should also do the trick.

> Anyway, I'll spend some more time on this. I'll let you know if I find
> anything.

Thanks.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-12 20:03                     ` Eric W. Biederman
@ 2017-05-13 14:34                       ` Guenter Roeck
  2017-05-13 18:21                         ` Eric W. Biederman
  2017-06-01 17:08                         ` Eric W. Biederman
  0 siblings, 2 replies; 28+ messages in thread
From: Guenter Roeck @ 2017-05-13 14:34 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

On 05/12/2017 01:03 PM, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
>
>> Hi Eric,
>>
>> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>>> Guenter Roeck <linux@roeck-us.net> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>>>>> Vovo Yang <vovoy@google.com> writes:
>>>>>
>>>>>> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>> Guenter Roeck <linux@roeck-us.net> writes:
>>>>>>>
>>>>>>>> What I know so far is
>>>>>>>> - We see this condition on a regular basis in the field. Regular is
>>>>>>>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>>>>>>   per day reports a crash because of it. That is not that many,
>>>>>>>>   but it adds up.
>>>>>>>> - We are able to reproduce the problem with a performance benchmark
>>>>>>>>   which opens 100 chrome tabs. While that is a lot, it should not
>>>>>>>>   result in a kernel hang/crash.
>>>>>>>> - Vovo proviced the test code last night. I don't know if this is
>>>>>>>>   exactly what is observed in the benchmark, or how it relates to the
>>>>>>>>   benchmark in the first place, but it is the first time we are actually
>>>>>>>>   able to reliably create a condition where the problem is seen.
>>>>>>>
>>>>>>> Thank you.  I will be interesting to hear what is happening in the
>>>>>>> chrome perfomance benchmark that triggers this.
>>>>>>>
>>>>>> What's happening in the benchmark:
>>>>>> 1. A chrome renderer process was created with CLONE_NEWPID
>>>>>> 2. The process crashed
>>>>>> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>>>>>>   threads of the crashed process to dump info
>>>>>> 4. When breakpad detach the crashed process, the crashed process stuck in
>>>>>>   zap_pid_ns_processes()
>>>>>
>>>>> Very interesting thank you.
>>>>>
>>>>> So the question is specifically which interaction is causing this.
>>>>>
>>>>> In the test case provided it was a sibling task in the pid namespace
>>>>> dying and not being reaped.  Which may be what is happening with
>>>>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>>>>>
>>>>
>>>> I am trying to understand what you are looking for. I would have thought
>>>> that both the test application as well as the Chrome functionality
>>>> described above show that there are situations where zap_pid_ns_processes()
>>>> can get stuck and cause hung task timeouts in conjunction with the use of
>>>> ptrace().
>>>>
>>>> Your last sentence seems to suggest that you believe that the kernel might
>>>> do what it is expected to do. Assuming this is the case, what else would
>>>> you like to see ? A test application which matches exactly the Chrome use
>>>> case ? We can try to provide that, but I don't entirely understand how
>>>> that would change the situation. After all, we already know that it is
>>>> possible to get a thread into this condition, and we already have one
>>>> means to reproduce it.
>>>>
>>>> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>>>> test application and the Chrome benchmark. The thread is still stuck in
>>>> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>>>> and no longer results in a hung task timeout. It remains in that state
>>>> until the parent process terminates. I am not entirely happy with it
>>>> since the processes are still stuck and may pile up over time, but at
>>>> least it solves the immediate problem for us.
>>>>
>>>> Question now is what to do with that solution. We can of course apply
>>>> it locally to Chrome OS, but I would rather have it upstream - especially
>>>> since we have to assume that any users of Chrome on Linux, or more
>>>> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>>>> experience the same problem. Right now I have no idea how to get there,
>>>> though. Can you provide some guidance ?
>>>
>>> Apologies for not being clear.  I intend to send a pull request with the
>>
>> No worries.
>>
>>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>>> next week or so with a Cc stable and an appropriate Fixes tag.  So the
>>> fix can be backported.
>>>
>> Great, that is good to know.
>>
>>> I have a more comprehensive change queued I will probably merge for 4.13
>>> already but it just changes what kind of zombies you see.  It won't
>>> remove the ``stuck'' zombies.
>>>
>>> So what I am looking for now is:
>>> Why are things getting stuck in your benchmark?
>>>
>>> -  Is it a userspace bug?
>>>
>>>   In which case we can figure out what userspace (aka breakpad) needs
>>>    to do to avoid the problem.
>>>
>> I spent some time trying to understand what breakpad is doing. I don't
>> really see how it can be blamed. It attaches itself to a crashing thread,
>> gets the information it is looking for, and detaches itself. At the
>> same time, whatever it does, it seems to me that userspace should not
>> be able to create such a situation in the first place.
>
> I suspect what is happening is that the thread breakpad is attached to
> is killed by zap_pid_ns_processes.  At which point PTRACE_DETACH won't
> work because the thread has been killed and waitpid(thread_pid,...)
> needs to be used instead.
>

I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
a zombie process.

I guess the only question left is if zap_pid_ns_processes() should (or could)
somehow detect that situation and return instead of waiting forever.
What do you think ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-13 14:34                       ` Guenter Roeck
@ 2017-05-13 18:21                         ` Eric W. Biederman
  2017-06-01 17:08                         ` Eric W. Biederman
  1 sibling, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-05-13 18:21 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:

> On 05/12/2017 01:03 PM, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>>
>>> Hi Eric,
>>>
>>> On Fri, May 12, 2017 at 12:33:01PM -0500, Eric W. Biederman wrote:
>>>> Guenter Roeck <linux@roeck-us.net> writes:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>>>>>> Vovo Yang <vovoy@google.com> writes:
>>>>>>
>>>>>>> On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>>>>>>> <ebiederm@xmission.com> wrote:
>>>>>>>> Guenter Roeck <linux@roeck-us.net> writes:
>>>>>>>>
>>>>>>>>> What I know so far is
>>>>>>>>> - We see this condition on a regular basis in the field. Regular is
>>>>>>>>>   relative, of course - let's say maybe 1 in a Milion Chromebooks
>>>>>>>>>   per day reports a crash because of it. That is not that many,
>>>>>>>>>   but it adds up.
>>>>>>>>> - We are able to reproduce the problem with a performance benchmark
>>>>>>>>>   which opens 100 chrome tabs. While that is a lot, it should not
>>>>>>>>>   result in a kernel hang/crash.
>>>>>>>>> - Vovo proviced the test code last night. I don't know if this is
>>>>>>>>>   exactly what is observed in the benchmark, or how it relates to the
>>>>>>>>>   benchmark in the first place, but it is the first time we are actually
>>>>>>>>>   able to reliably create a condition where the problem is seen.
>>>>>>>>
>>>>>>>> Thank you.  I will be interesting to hear what is happening in the
>>>>>>>> chrome perfomance benchmark that triggers this.
>>>>>>>>
>>>>>>> What's happening in the benchmark:
>>>>>>> 1. A chrome renderer process was created with CLONE_NEWPID
>>>>>>> 2. The process crashed
>>>>>>> 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>>>>>>>   threads of the crashed process to dump info
>>>>>>> 4. When breakpad detach the crashed process, the crashed process stuck in
>>>>>>>   zap_pid_ns_processes()
>>>>>>
>>>>>> Very interesting thank you.
>>>>>>
>>>>>> So the question is specifically which interaction is causing this.
>>>>>>
>>>>>> In the test case provided it was a sibling task in the pid namespace
>>>>>> dying and not being reaped.  Which may be what is happening with
>>>>>> breakpad.  So far I have yet to see kernel bug but I won't rule one out.
>>>>>>
>>>>>
>>>>> I am trying to understand what you are looking for. I would have thought
>>>>> that both the test application as well as the Chrome functionality
>>>>> described above show that there are situations where zap_pid_ns_processes()
>>>>> can get stuck and cause hung task timeouts in conjunction with the use of
>>>>> ptrace().
>>>>>
>>>>> Your last sentence seems to suggest that you believe that the kernel might
>>>>> do what it is expected to do. Assuming this is the case, what else would
>>>>> you like to see ? A test application which matches exactly the Chrome use
>>>>> case ? We can try to provide that, but I don't entirely understand how
>>>>> that would change the situation. After all, we already know that it is
>>>>> possible to get a thread into this condition, and we already have one
>>>>> means to reproduce it.
>>>>>
>>>>> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
>>>>> test application and the Chrome benchmark. The thread is still stuck in
>>>>> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
>>>>> and no longer results in a hung task timeout. It remains in that state
>>>>> until the parent process terminates. I am not entirely happy with it
>>>>> since the processes are still stuck and may pile up over time, but at
>>>>> least it solves the immediate problem for us.
>>>>>
>>>>> Question now is what to do with that solution. We can of course apply
>>>>> it locally to Chrome OS, but I would rather have it upstream - especially
>>>>> since we have to assume that any users of Chrome on Linux, or more
>>>>> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
>>>>> experience the same problem. Right now I have no idea how to get there,
>>>>> though. Can you provide some guidance ?
>>>>
>>>> Apologies for not being clear.  I intend to send a pull request with the
>>>
>>> No worries.
>>>
>>>> the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
>>>> next week or so with a Cc stable and an appropriate Fixes tag.  So the
>>>> fix can be backported.
>>>>
>>> Great, that is good to know.
>>>
>>>> I have a more comprehensive change queued I will probably merge for 4.13
>>>> already but it just changes what kind of zombies you see.  It won't
>>>> remove the ``stuck'' zombies.
>>>>
>>>> So what I am looking for now is:
>>>> Why are things getting stuck in your benchmark?
>>>>
>>>> -  Is it a userspace bug?
>>>>
>>>>   In which case we can figure out what userspace (aka breakpad) needs
>>>>    to do to avoid the problem.
>>>>
>>> I spent some time trying to understand what breakpad is doing. I don't
>>> really see how it can be blamed. It attaches itself to a crashing thread,
>>> gets the information it is looking for, and detaches itself. At the
>>> same time, whatever it does, it seems to me that userspace should not
>>> be able to create such a situation in the first place.
>>
>> I suspect what is happening is that the thread breakpad is attached to
>> is killed by zap_pid_ns_processes.  At which point PTRACE_DETACH won't
>> work because the thread has been killed and waitpid(thread_pid,...)
>> needs to be used instead.
>>
>
> I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> a zombie process.
>
> I guess the only question left is if zap_pid_ns_processes() should (or could)
> somehow detect that situation and return instead of waiting forever.
> What do you think ?

In this case zap_pid_ns_processes is waiting for everything to be done
with the processes in the pid namespace.  As something outside the
namespace is tracing processes inside the namespace things they aren't
done with the processes in the pid namespace.

So in general I think the current behavior is correct.  If the tracer
(breakpad?) had set SA_NOCLDWAIT or set SIGCHLD to SIG_IGN I would
figure the failure as a kernel bug.

At this point it just appears that the tracer has a bug (ignoring
SIGCHLD and not reaping dead children) that should be fixed.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-05-13 14:34                       ` Guenter Roeck
  2017-05-13 18:21                         ` Eric W. Biederman
@ 2017-06-01 17:08                         ` Eric W. Biederman
  2017-06-01 18:45                           ` Guenter Roeck
  1 sibling, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-06-01 17:08 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:
>
> I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> a zombie process.
>
> I guess the only question left is if zap_pid_ns_processes() should (or could)
> somehow detect that situation and return instead of waiting forever.
> What do you think ?

Any chance you can point me at the chromium code that is performing the
ptrace?

I want to conduct a review of the kernel semantics to see if the current
semantics make it unnecessarily easy to get into hang situations.  If
the semantics make it really easy to get into a hang situation I want
to see if there is anything we can do to delicately change the semantics
to avoid the hangs without breaking existing userspace.

We have a real problem in exec which has similar semantics and as long
as I am looking at one I figure I should look at the other.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-06-01 17:08                         ` Eric W. Biederman
@ 2017-06-01 18:45                           ` Guenter Roeck
  2017-06-01 19:36                             ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-06-01 18:45 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> >
> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> > a zombie process.
> >
> > I guess the only question left is if zap_pid_ns_processes() should (or could)
> > somehow detect that situation and return instead of waiting forever.
> > What do you think ?
> 
> Any chance you can point me at the chromium code that is performing the
> ptrace?
> 
> I want to conduct a review of the kernel semantics to see if the current
> semantics make it unnecessarily easy to get into hang situations.  If
> the semantics make it really easy to get into a hang situation I want
> to see if there is anything we can do to delicately change the semantics
> to avoid the hangs without breaking existing userspace.
> 
The internal bug should be accessible to you.

https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2

It has some additional information, and points to the following code in Chrome.

https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85

With the information we have, I don't really have a good idea what we could or
should change in Chrome to make the problem disappear, so I just concluded that
we'll have to live with the forever-sleeping task.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-06-01 18:45                           ` Guenter Roeck
@ 2017-06-01 19:36                             ` Eric W. Biederman
  2017-06-01 21:43                               ` Guenter Roeck
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2017-06-01 19:36 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> >
>> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> > a zombie process.
>> >
>> > I guess the only question left is if zap_pid_ns_processes() should (or could)
>> > somehow detect that situation and return instead of waiting forever.
>> > What do you think ?
>> 
>> Any chance you can point me at the chromium code that is performing the
>> ptrace?
>> 
>> I want to conduct a review of the kernel semantics to see if the current
>> semantics make it unnecessarily easy to get into hang situations.  If
>> the semantics make it really easy to get into a hang situation I want
>> to see if there is anything we can do to delicately change the semantics
>> to avoid the hangs without breaking existing userspace.
>> 
> The internal bug should be accessible to you.
>
> https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2
>
> It has some additional information, and points to the following code in Chrome.
>
> https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85
>
> With the information we have, I don't really have a good idea what we could or
> should change in Chrome to make the problem disappear, so I just concluded that
> we'll have to live with the forever-sleeping task.

I believe I see what is happening.  The code makes the assumption that a
thread will stay stopped and will not go away once ptrace attach
completes.

Unfortunately if someone sends SIGKILL to the process or exec sends
SIGKILL to the individual thread then PTRACE_DETACH will fail.

At which point you can use waitpid to reap the zombie and detach
from the thread.

So I think the forever-sleeping can be fixed with something as simple
as changing ResumeThread to say:

// Resumes a thread by detaching from it.
static bool ResumeThread(pid_t pid) {
  if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
  	return true;
  /* Someone killed the thread? */
  return waitpid(pid, NULL, 0) == pid;
}

It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
allow this case to work.  And odds are good that we could make that
change without breaking anyone.  So it is worth a try.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-06-01 19:36                             ` Eric W. Biederman
@ 2017-06-01 21:43                               ` Guenter Roeck
  2017-06-02  1:06                                 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Guenter Roeck @ 2017-06-01 21:43 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
> Guenter Roeck <linux@roeck-us.net> writes:
> 
> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
> >> Guenter Roeck <linux@roeck-us.net> writes:
> >> >
> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
> >> > a zombie process.
> >> >
> >> > I guess the only question left is if zap_pid_ns_processes() should (or could)
> >> > somehow detect that situation and return instead of waiting forever.
> >> > What do you think ?
> >> 
> >> Any chance you can point me at the chromium code that is performing the
> >> ptrace?
> >> 
> >> I want to conduct a review of the kernel semantics to see if the current
> >> semantics make it unnecessarily easy to get into hang situations.  If
> >> the semantics make it really easy to get into a hang situation I want
> >> to see if there is anything we can do to delicately change the semantics
> >> to avoid the hangs without breaking existing userspace.
> >> 
> > The internal bug should be accessible to you.
> >
> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2
> >
> > It has some additional information, and points to the following code in Chrome.
> >
> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85
> >
> > With the information we have, I don't really have a good idea what we could or
> > should change in Chrome to make the problem disappear, so I just concluded that
> > we'll have to live with the forever-sleeping task.
> 
> I believe I see what is happening.  The code makes the assumption that a
> thread will stay stopped and will not go away once ptrace attach
> completes.
> 
> Unfortunately if someone sends SIGKILL to the process or exec sends
> SIGKILL to the individual thread then PTRACE_DETACH will fail.
> 
> At which point you can use waitpid to reap the zombie and detach
> from the thread.
> 
> So I think the forever-sleeping can be fixed with something as simple
> as changing ResumeThread to say:
> 
> // Resumes a thread by detaching from it.
> static bool ResumeThread(pid_t pid) {
>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>   	return true;
>   /* Someone killed the thread? */
>   return waitpid(pid, NULL, 0) == pid;
> }
> 
> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
> allow this case to work.  And odds are good that we could make that
> change without breaking anyone.  So it is worth a try.
> 

Do I interpret this correctly as "the above code should work, but currently
doesn't" ?

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Threads stuck in zap_pid_ns_processes()
  2017-06-01 21:43                               ` Guenter Roeck
@ 2017-06-02  1:06                                 ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2017-06-02  1:06 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Vovo Yang, Ingo Molnar, linux-kernel

Guenter Roeck <linux@roeck-us.net> writes:

> On Thu, Jun 01, 2017 at 02:36:38PM -0500, Eric W. Biederman wrote:
>> Guenter Roeck <linux@roeck-us.net> writes:
>> 
>> > On Thu, Jun 01, 2017 at 12:08:58PM -0500, Eric W. Biederman wrote:
>> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >> >
>> >> > I think you nailed it. If I drop CLONE_NEWPID from the reproducer I get
>> >> > a zombie process.
>> >> >
>> >> > I guess the only question left is if zap_pid_ns_processes() should (or could)
>> >> > somehow detect that situation and return instead of waiting forever.
>> >> > What do you think ?
>> >> 
>> >> Any chance you can point me at the chromium code that is performing the
>> >> ptrace?
>> >> 
>> >> I want to conduct a review of the kernel semantics to see if the current
>> >> semantics make it unnecessarily easy to get into hang situations.  If
>> >> the semantics make it really easy to get into a hang situation I want
>> >> to see if there is anything we can do to delicately change the semantics
>> >> to avoid the hangs without breaking existing userspace.
>> >> 
>> > The internal bug should be accessible to you.
>> >
>> > https://bugs.chromium.org/p/chromium/issues/detail?id=721298&desc=2
>> >
>> > It has some additional information, and points to the following code in Chrome.
>> >
>> > https://cs.chromium.org/chromium/src/breakpad/src/client/linux/minidump_writer/linux_ptrace_dumper.cc?rcl=47e51739fd00badbceba5bc26b8abc8bbd530989&l=85
>> >
>> > With the information we have, I don't really have a good idea what we could or
>> > should change in Chrome to make the problem disappear, so I just concluded that
>> > we'll have to live with the forever-sleeping task.
>> 
>> I believe I see what is happening.  The code makes the assumption that a
>> thread will stay stopped and will not go away once ptrace attach
>> completes.
>> 
>> Unfortunately if someone sends SIGKILL to the process or exec sends
>> SIGKILL to the individual thread then PTRACE_DETACH will fail.
>> 
>> At which point you can use waitpid to reap the zombie and detach
>> from the thread.
>> 
>> So I think the forever-sleeping can be fixed with something as simple
>> as changing ResumeThread to say:
>> 
>> // Resumes a thread by detaching from it.
>> static bool ResumeThread(pid_t pid) {
>>   if (sys_ptrace(PTRACE_DETACH, pid, NULL, NULL) >= 0)
>>   	return true;
>>   /* Someone killed the thread? */
>>   return waitpid(pid, NULL, 0) == pid;
>> }
>> 
>> It almost certainly makes sense to fix PTRACE_DETACH in the kernel to
>> allow this case to work.  And odds are good that we could make that
>> change without breaking anyone.  So it is worth a try.
>> 
>
> Do I interpret this correctly as "the above code should work, but currently
> doesn't" ?

I added the early exit and the fallback waitpid clause.  So I am saying
with a trivial modification the code can be made to work.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-06-02  1:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-11 17:11 Threads stuck in zap_pid_ns_processes() Guenter Roeck
2017-05-11 17:31 ` Eric W. Biederman
2017-05-11 18:35   ` Guenter Roeck
2017-05-11 20:23     ` Eric W. Biederman
2017-05-11 20:48       ` Guenter Roeck
2017-05-11 21:39         ` Eric W. Biederman
2017-05-11 20:21   ` Guenter Roeck
2017-05-11 21:25     ` Eric W. Biederman
2017-05-11 22:47       ` Guenter Roeck
2017-05-11 23:19         ` Eric W. Biederman
2017-05-12  9:30           ` Vovo Yang
2017-05-12 13:26             ` Eric W. Biederman
2017-05-12 16:52               ` Guenter Roeck
2017-05-12 17:33                 ` Eric W. Biederman
     [not found]                   ` <874lwqyo8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-05-12 17:55                     ` [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes Eric W. Biederman
2017-05-12 17:55                       ` Eric W. Biederman
     [not found]                       ` <87d1bex8mt.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2017-05-12 19:33                         ` Guenter Roeck
2017-05-12 19:33                           ` Guenter Roeck
2017-05-12 19:43                   ` Threads stuck in zap_pid_ns_processes() Guenter Roeck
2017-05-12 20:03                     ` Eric W. Biederman
2017-05-13 14:34                       ` Guenter Roeck
2017-05-13 18:21                         ` Eric W. Biederman
2017-06-01 17:08                         ` Eric W. Biederman
2017-06-01 18:45                           ` Guenter Roeck
2017-06-01 19:36                             ` Eric W. Biederman
2017-06-01 21:43                               ` Guenter Roeck
2017-06-02  1:06                                 ` Eric W. Biederman
2017-05-12  3:42         ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.