linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* strace lockup when tracing exec in go
@ 2016-09-21 15:29 Michal Hocko
  2016-09-22  4:15 ` Mike Galbraith
  2016-09-23  9:50 ` Oleg Nesterov
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-09-21 15:29 UTC (permalink / raw)
  To: LKML; +Cc: strace-devel, Oleg Nesterov, Aleksa Sarai

[I am CCing strace mailing list because even if this turns out to be a
 kernel bug strace seems to be doing something unexpected - more on that
 below]

Hi,
Aleksa has reported the following lockup when stracing the following go
program

% cat exec.go
package main

import (
    "os"
    "syscall"
)

func main() {
    syscall.Exec("/bin/echo", []string{"/bin/echo", "Hello"}, os.Environ())
}
$ go version
go version go1.6.3 linux/amd64
$ go build -o exec exec.go

$ strace -f ./exec
[snip]
[pid 10349] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10346] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10345] execve("/bin/echo", ["/bin/echo", "Hello"], [/* 95 vars */] <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10349] <... select resumed> )      = 0 (Timeout)

execve will never finish unless the strace process get killed with
SIGKILL. The following was debugged on 3.12 kernel and the current
kernel seems to not trigger the issue as easily but I believe the same
problem is there as well.

The further investigation shown that the tracer (strace) is stuck
waiting for cred_guard_mutex
[<0000000000000000>] mm_access+0x22/0xa0
[<0000000000000000>] process_vm_rw_core.isra.1+0x112/0x6c0
[<0000000000000000>] process_vm_rw+0xab/0x110
[<0000000000000000>] SyS_process_vm_readv+0x15/0x20
[<0000000000000000>] system_call_fastpath+0x16/0x1b
[<0000000000000000>] 0x7f186f031c3a
[<0000000000000000>] 0xffffffffffffffff

while the traced process (3 threads) are trying to perform the exec.
That means that 2 threads are dead (in fact zombies) waiting in their
final schedule.
Call Trace:
 [<ffffffff81057c17>] do_exit+0x6f7/0xa70
 [<ffffffff81058009>] do_group_exit+0x39/0xa0
 [<ffffffff810670d8>] get_signal_to_deliver+0x1e8/0x5c0
 [<ffffffff81002342>] do_signal+0x42/0x670
 [<ffffffff810029e8>] do_notify_resume+0x78/0xc0
 [<ffffffff8151f4c2>] int_signal+0x12/0x17
 [<00007f3a33f3ffb9>] 0x7f3a33f3ffb8

and one is
Call Trace:
 [<ffffffff811a80bf>] flush_old_exec+0xdf/0x890
 [<ffffffff811f4197>] load_elf_binary+0x307/0xda0
 [<ffffffff811a7b8e>] search_binary_handler+0xae/0x1f0
 [<ffffffff811a917e>] do_execve_common.isra.26+0x64e/0x810
 [<ffffffff811a95c1>] SyS_execve+0x31/0x50
 [<ffffffff8151f7b9>] stub_execve+0x69/0xa0
 [<00007f3a33f16527>] 0x7f3a33f16526

waiting for notify_count to drop down to 0
        while (sig->notify_count) {
                __set_current_state(TASK_KILLABLE);
                spin_unlock_irq(lock);
                schedule(); <<<<<<<<<
                if (unlikely(__fatal_signal_pending(tsk)))
                        goto killed;
                spin_lock_irq(lock);
        }

this however doesn't happen because both threads which are dead
are zombies waiting to be reaped by the parent and to call their
release_task->__exit_signal. The tracer is blocked on the lock held by
exec (in prepare_bprm_creds). This is the case in the current kernel as
well AFAICS so the same should be possible as well. So is this a bug or
something somewhere else makes sure that this will not happen in newer
kernels?

Now, I've tried to fix this by
diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (err)
 		return ERR_PTR(err);
 
but it didn't help and I really couldn't see why until I sprinkled
printks all over the exit path and to learn that the parent (strace)
doesn't get notified about SIGCHLD when those threads are dying. Why is that?
exit_notify
  do_notify_parent -> return false
    __group_send_sig_info
      send_signal
        __send_signal
          prepare_signal -> fails because the signal is ignored

this is more than unexpected (at least to me) because I would expect
that the strace wouldn't ignore SIGCHLD! Is this a bug in strace?

Btw. in a successful (not hung) case the tracer is in do_wait
when do_notify_parent->__wake_up_parent does the right thing.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-21 15:29 strace lockup when tracing exec in go Michal Hocko
@ 2016-09-22  4:15 ` Mike Galbraith
  2016-09-22  8:01   ` Michal Hocko
  2016-09-23  9:50 ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2016-09-22  4:15 UTC (permalink / raw)
  To: Michal Hocko, LKML; +Cc: strace-devel, Oleg Nesterov, Aleksa Sarai

On Wed, 2016-09-21 at 17:29 +0200, Michal Hocko wrote:
> [I am CCing strace mailing list because even if this turns out to be a
>  kernel bug strace seems to be doing something unexpected - more on that
>  below]
> 
> Hi,
> Aleksa has reported the following lockup when stracing the following go
> program
> 
> % cat exec.go
> package main
> 
> import (
>     "os"
>     "syscall"
> )
> 
> func main() {
>     syscall.Exec("/bin/echo", []string{"/bin/echo", "Hello"}, os.Environ())
> }
> $ go version
> go version go1.6.3 linux/amd64
> $ go build -o exec exec.go
> 
> $ strace -f ./exec
> [snip]
> [pid 10349] select(0, NULL, NULL, NULL, {0, 100} 
> [pid 10346] <... select resumed> )      = 0 (Timeout)
> [pid 10346] select(0, NULL, NULL, NULL, {0, 20} 
> [pid 10345] execve("/bin/echo", ["/bin/echo", "Hello"], [/* 95 vars */] 
> [pid 10346] <... select resumed> )      = 0 (Timeout)
> [pid 10349] <... select resumed> )      = 0 (Timeout)
> 
> execve will never finish unless the strace process get killed with
> SIGKILL. The following was debugged on 3.12 kernel and the current
> kernel seems to not trigger the issue as easily but I believe the same
> problem is there as well.
> 
> The further investigation shown that the tracer (strace) is stuck
> waiting for cred_guard_mutex
> [<0000000000000000>] mm_access+0x22/0xa0
> [<0000000000000000>] process_vm_rw_core.isra.1+0x112/0x6c0
> [<0000000000000000>] process_vm_rw+0xab/0x110
> [<0000000000000000>] SyS_process_vm_readv+0x15/0x20
> [<0000000000000000>] system_call_fastpath+0x16/0x1b
> [<0000000000000000>] 0x7f186f031c3a
> [<0000000000000000>] 0xffffffffffffffff
> 
> while the traced process (3 threads) are trying to perform the exec.
> That means that 2 threads are dead (in fact zombies) waiting in their
> final schedule.
> Call Trace:
>  [] do_exit+0x6f7/0xa70
>  [] do_group_exit+0x39/0xa0
>  [] get_signal_to_deliver+0x1e8/0x5c0
>  [] do_signal+0x42/0x670
>  [] do_notify_resume+0x78/0xc0
>  [] int_signal+0x12/0x17
>  [<00007f3a33f3ffb9>] 0x7f3a33f3ffb8
> 
> and one is
> Call Trace:
>  [] flush_old_exec+0xdf/0x890
>  [] load_elf_binary+0x307/0xda0
>  [] search_binary_handler+0xae/0x1f0
>  [] do_execve_common.isra.26+0x64e/0x810
>  [] SyS_execve+0x31/0x50
>  [] stub_execve+0x69/0xa0
>  [<00007f3a33f16527>] 0x7f3a33f16526
> 
> waiting for notify_count to drop down to 0
>         while (sig->notify_count) {
>                 __set_current_state(TASK_KILLABLE);
>                 spin_unlock_irq(lock);
>                 schedule(); <<<<<<<<<
>                 if (unlikely(__fatal_signal_pending(tsk)))
>                         goto killed;
>                 spin_lock_irq(lock);
>         }
> 
> this however doesn't happen because both threads which are dead
> are zombies waiting to be reaped by the parent and to call their
> release_task->__exit_signal. The tracer is blocked on the lock held by
> exec (in prepare_bprm_creds). This is the case in the current kernel as
> well AFAICS so the same should be possible as well. So is this a bug or
> something somewhere else makes sure that this will not happen in newer
> kernels?

master.today...

crash> bt 5138
PID: 5138   TASK: ffff8803fb4e4c80  CPU: 6   COMMAND: "strace"
 #0 [ffff88038e323b68] __schedule at ffffffff81624660
 #1 [ffff88038e323bb0] schedule at ffffffff81624b35
 #2 [ffff88038e323bc8] schedule_preempt_disabled at ffffffff81624e0e
 #3 [ffff88038e323bd8] __mutex_lock_killable_slowpath at ffffffff81625f9d
 #4 [ffff88038e323c30] mutex_lock_killable at ffffffff8162605a
 #5 [ffff88038e323c48] mm_access at ffffffff8105bbf7
 #6 [ffff88038e323c70] process_vm_rw_core at ffffffff811823d3
 #7 [ffff88038e323dc0] process_vm_rw at ffffffff8118287d
 #8 [ffff88038e323f38] sys_process_vm_readv at ffffffff811829b9
 #9 [ffff88038e323f50] entry_SYSCALL_64_fastpath at ffffffff81628736
    RIP: 00007faee6b2ac3a  RSP: 00007ffdc95b62b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00000000019e7030  RCX: 00007faee6b2ac3a
    RDX: 0000000000000001  RSI: 00007ffdc95b62d0  RDI: 0000000000001418
    RBP: 0000000000001414   R8: 0000000000000001   R9: 0000000000000000
    R10: 00007ffdc95b62e0  R11: 0000000000000246  R12: 00007faee73e66c0
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000136  CS: 0033  SS: 002b
crash> bt 5140
PID: 5140   TASK: ffff8803e703b300  CPU: 4   COMMAND: "exec"
 #0 [ffff88038e0b3ce8] __schedule at ffffffff81624660
 #1 [ffff88038e0b3d30] schedule at ffffffff81624b35
 #2 [ffff88038e0b3d48] flush_old_exec at ffffffff811b1f26
 #3 [ffff88038e0b3d88] load_elf_binary at ffffffff811fd70f
 #4 [ffff88038e0b3e60] search_binary_handler at ffffffff811b13ac
 #5 [ffff88038e0b3e98] do_execveat_common at ffffffff811b2f8f
 #6 [ffff88038e0b3f00] sys_execve at ffffffff811b33ba
 #7 [ffff88038e0b3f28] do_syscall_64 at ffffffff8100194e
    RIP: 000000000045c1ef  RSP: 000000c820043e00  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000000c82000e6c0  RCX: 000000000045c1ef
    RDX: 000000c820076380  RSI: 000000c82000e6c0  RDI: 000000c82000a400
    RBP: 000000c820076380   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000000006b
    R13: 00000000004c4b40  R14: 0000000000003fff  R15: 0000000000000039
    ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b
crash> bt 5141
PID: 5141   TASK: ffff8803b0221980  CPU: 0   COMMAND: "exec"
 #0 [ffff88038e33fc60] __schedule at ffffffff81624660
 #1 [ffff88038e33fca8] schedule at ffffffff81624b35
 #2 [ffff88038e33fcc0] do_exit at ffffffff81061292
 #3 [ffff88038e33fd28] do_group_exit at ffffffff8106247f
 #4 [ffff88038e33fd58] get_signal at ffffffff8106c7fc
 #5 [ffff88038e33fdf0] do_signal at ffffffff81018468
 #6 [ffff88038e33fef8] exit_to_usermode_loop at ffffffff81059820
 #7 [ffff88038e33ff28] do_syscall_64 at ffffffff81001a35
    RIP: 00000000004538d3  RSP: 000000c820041f38  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 000000c820041f70  RCX: 00000000004538d3
    RDX: 0000000000000000  RSI: 000000c820041f70  RDI: 0000000000000000
    RBP: ffffffffffffffff   R8: 000000c820032000   R9: 000000c820000480
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000042aa30
    R13: 00000000004c4bc4  R14: 0000000000000000  R15: 0000000000000008
    ORIG_RAX: 0000000000000083  CS: 0033  SS: 002b
crash>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  4:15 ` Mike Galbraith
@ 2016-09-22  8:01   ` Michal Hocko
  2016-09-22  8:20     ` Aleksa Sarai
  2016-09-22  8:36     ` Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-09-22  8:01 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu 22-09-16 06:15:02, Mike Galbraith wrote:
[...]
> master.today...

Thanks for trying to reproduce this. My tiny laptop (2 cores, 2 threads
per core) cannot reproduce even in 10 minutes or so. I've tried to use
the same machine I was testing with 3.12 kernel (2 sockets, 8 cores per
soc. and 2 threas per core) and it hit almost instantly. I have tried 
mutex_lock_killable -> interruptible and it didn't help as I've
expected. So the current kernel doesn't do any magic to prevent from the
issue as well.

So I've stared into do_notify_parent some more and the following was
just very confusing

	if (!tsk->ptrace && sig == SIGCHLD &&
	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
		/*
		 * We are exiting and our parent doesn't care.  POSIX.1
		 * defines special semantics for setting SIGCHLD to SIG_IGN
		 * or setting the SA_NOCLDWAIT flag: we should be reaped
		 * automatically and not left for our parent's wait4 call.
		 * Rather than having the parent do it as a magic kind of
		 * signal handler, we just set this to tell do_exit that we
		 * can be cleaned up without becoming a zombie.  Note that
		 * we still call __wake_up_parent in this case, because a
		 * blocked sys_wait4 might now return -ECHILD.
		 *
		 * Whether we send SIGCHLD or not for SA_NOCLDWAIT
		 * is implementation-defined: we do (if you don't want
		 * it, just use SIG_IGN instead).
		 */
		autoreap = true;
		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
			sig = 0;
	}

it tries to prevent from what I am seeing in a way. If the SIGCHLD is
ignored then it just does autoreap and everything is fine. But this
doesn't seem to be the case here. In fact we are not sending the signal
because sig_task_ignored is true resp. sig_handler_ignored which can
fail even for handler == SIG_DFL && sig_kernel_ignore() and SIGCHLD
seems to be in SIG_KERNEL_IGNORE_MASK. So I've tried

diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..1840c7f4e3c2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1630,7 +1630,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	psig = tsk->parent->sighand;
 	spin_lock_irqsave(&psig->siglock, flags);
 	if (!tsk->ptrace && sig == SIGCHLD &&
-	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
+	    (sig_handler_ignored(&psig->action[SIGCHLD-1].sa.sa_handler, SIGCHLD) ||
 	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
 		/*
 		 * We are exiting and our parent doesn't care.  POSIX.1

And it worked out. Now I do not have any idea why the check is explicit
for SIG_IGN rather than sig_handler_ignored, maybe there is a strong
reason for that. Oleg? I will cook up a full patch if this looks ok.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  8:01   ` Michal Hocko
@ 2016-09-22  8:20     ` Aleksa Sarai
  2016-09-22  8:36     ` Michal Hocko
  1 sibling, 0 replies; 19+ messages in thread
From: Aleksa Sarai @ 2016-09-22  8:20 UTC (permalink / raw)
  To: Michal Hocko, Mike Galbraith; +Cc: LKML, strace-devel, Oleg Nesterov

> So I've stared into do_notify_parent some more and the following was
> just very confusing
>
> 	if (!tsk->ptrace && sig == SIGCHLD &&
> 	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> 	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> 		/*
> 		 * We are exiting and our parent doesn't care.  POSIX.1
> 		 * defines special semantics for setting SIGCHLD to SIG_IGN
> 		 * or setting the SA_NOCLDWAIT flag: we should be reaped
> 		 * automatically and not left for our parent's wait4 call.
> 		 * Rather than having the parent do it as a magic kind of
> 		 * signal handler, we just set this to tell do_exit that we
> 		 * can be cleaned up without becoming a zombie.  Note that
> 		 * we still call __wake_up_parent in this case, because a
> 		 * blocked sys_wait4 might now return -ECHILD.
> 		 *
> 		 * Whether we send SIGCHLD or not for SA_NOCLDWAIT
> 		 * is implementation-defined: we do (if you don't want
> 		 * it, just use SIG_IGN instead).
> 		 */
> 		autoreap = true;
> 		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
> 			sig = 0;
> 	}
>
> it tries to prevent from what I am seeing in a way. If the SIGCHLD is
> ignored then it just does autoreap and everything is fine. But this
> doesn't seem to be the case here. In fact we are not sending the signal
> because sig_task_ignored is true resp. sig_handler_ignored which can
> fail even for handler == SIG_DFL && sig_kernel_ignore() and SIGCHLD
> seems to be in SIG_KERNEL_IGNORE_MASK. So I've tried

I was looking at the same code this morning. I thought maybe we should 
drop the !tsk->ptrace condition (or make it so that the condition still 
succeeds if the tracer also happens to be tsk->real_parent) -- since 
this is only happening when the process is being traced? I tried this 
and the issue still persists, but I didn't apply your other proposed 
change to this conditional.

Or am I misunderstanding what tsk->ptrace refers to?

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  8:01   ` Michal Hocko
  2016-09-22  8:20     ` Aleksa Sarai
@ 2016-09-22  8:36     ` Michal Hocko
  2016-09-22  9:40       ` Mike Galbraith
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-09-22  8:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu 22-09-16 10:01:26, Michal Hocko wrote:
> On Thu 22-09-16 06:15:02, Mike Galbraith wrote:
> [...]
> > master.today...
> 
> Thanks for trying to reproduce this. My tiny laptop (2 cores, 2 threads
> per core) cannot reproduce even in 10 minutes or so. I've tried to use
> the same machine I was testing with 3.12 kernel (2 sockets, 8 cores per
> soc. and 2 threas per core) and it hit almost instantly. I have tried 
> mutex_lock_killable -> interruptible and it didn't help as I've
> expected. So the current kernel doesn't do any magic to prevent from the
> issue as well.
> 
> So I've stared into do_notify_parent some more and the following was
> just very confusing
> 
> 	if (!tsk->ptrace && sig == SIGCHLD &&
> 	    (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> 	     (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> 		/*
> 		 * We are exiting and our parent doesn't care.  POSIX.1
> 		 * defines special semantics for setting SIGCHLD to SIG_IGN
> 		 * or setting the SA_NOCLDWAIT flag: we should be reaped
> 		 * automatically and not left for our parent's wait4 call.
> 		 * Rather than having the parent do it as a magic kind of
> 		 * signal handler, we just set this to tell do_exit that we
> 		 * can be cleaned up without becoming a zombie.  Note that
> 		 * we still call __wake_up_parent in this case, because a
> 		 * blocked sys_wait4 might now return -ECHILD.
> 		 *
> 		 * Whether we send SIGCHLD or not for SA_NOCLDWAIT
> 		 * is implementation-defined: we do (if you don't want
> 		 * it, just use SIG_IGN instead).
> 		 */
> 		autoreap = true;
> 		if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
> 			sig = 0;
> 	}
> 
> it tries to prevent from what I am seeing in a way. If the SIGCHLD is
> ignored then it just does autoreap and everything is fine. But this
> doesn't seem to be the case here. In fact we are not sending the signal
> because sig_task_ignored is true resp. sig_handler_ignored which can
> fail even for handler == SIG_DFL && sig_kernel_ignore() and SIGCHLD
> seems to be in SIG_KERNEL_IGNORE_MASK. So I've tried

Dohh, I've missed !tsk->ptrace check there. So we are not even going
that via that path. So the sig_handler_ignored cannot possible help.
I was just too lucky and didn't hit the lockup with the patch.

So what else might be wrong here? sig_ignored seems to be quite
confusing

	/*
	 * Tracers may want to know about even ignored signals.
	 */
	return !t->ptrace;

t is the tracer here but it shouldn't have t->ptrace because the child
is not stopped. So do we need something like the following? Not tested
yet

diff --git a/kernel/signal.c b/kernel/signal.c
index 1840c7f4e3c2..bd236ce4a29c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
 	if (!sig_task_ignored(t, sig, force))
 		return 0;
 
+	/* Do not ignore signals sent from child to the parent */
+	if (current->ptrace && current->parent == t)
+		return 0;
+
 	/*
 	 * Tracers may want to know about even ignored signals.
 	 */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  8:36     ` Michal Hocko
@ 2016-09-22  9:40       ` Mike Galbraith
  2016-09-22  9:53         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2016-09-22  9:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu, 2016-09-22 at 10:36 +0200, Michal Hocko wrote:
> On Thu 22-09-16 10:01:26, Michal Hocko wrote:
> > On Thu 22-09-16 06:15:02, Mike Galbraith wrote:
> > [...]
> > > master.today...
> > 
> > Thanks for trying to reproduce this. My tiny laptop (2 cores, 2 threads
> > per core) cannot reproduce even in 10 minutes or so. I've tried to use
> > the same machine I was testing with 3.12 kernel (2 sockets, 8 cores per
> > soc. and 2 threas per core) and it hit almost instantly. I have tried 
> > mutex_lock_killable -> interruptible and it didn't help as I've
> > expected. So the current kernel doesn't do any magic to prevent from the
> > issue as well.
> > 
> > So I've stared into do_notify_parent some more and the following was
> > just very confusing
> > 
> > 	> > if (!tsk->ptrace && sig == SIGCHLD &&
> > 	> >     (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> > 	> >      (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> > 	> > 	> > /*
> > 	> > 	> >  * We are exiting and our parent doesn't care.  POSIX.1
> > 	> > 	> >  * defines special semantics for setting SIGCHLD to SIG_IGN
> > 	> > 	> >  * or setting the SA_NOCLDWAIT flag: we should be reaped
> > 	> > 	> >  * automatically and not left for our parent's wait4 call.
> > 	> > 	> >  * Rather than having the parent do it as a magic kind of
> > 	> > 	> >  * signal handler, we just set this to tell do_exit that we
> > 	> > 	> >  * can be cleaned up without becoming a zombie.  Note that
> > 	> > 	> >  * we still call __wake_up_parent in this case, because a
> > 	> > 	> >  * blocked sys_wait4 might now return -ECHILD.
> > 	> > 	> >  *
> > 	> > 	> >  * Whether we send SIGCHLD or not for SA_NOCLDWAIT
> > 	> > 	> >  * is implementation-defined: we do (if you don't want
> > 	> > 	> >  * it, just use SIG_IGN instead).
> > 	> > 	> >  */
> > 	> > 	> > autoreap = true;
> > 	> > 	> > if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
> > 	> > 	> > 	> > sig = 0;
> > 	> > }
> > 
> > it tries to prevent from what I am seeing in a way. If the SIGCHLD is
> > ignored then it just does autoreap and everything is fine. But this
> > doesn't seem to be the case here. In fact we are not sending the signal
> > because sig_task_ignored is true resp. sig_handler_ignored which can
> > fail even for handler == SIG_DFL && sig_kernel_ignore() and SIGCHLD
> > seems to be in SIG_KERNEL_IGNORE_MASK. So I've tried
> 
> Dohh, I've missed !tsk->ptrace check there. So we are not even going
> that via that path. So the sig_handler_ignored cannot possible help.
> I was just too lucky and didn't hit the lockup with the patch.
> 
> So what else might be wrong here? sig_ignored seems to be quite
> confusing
> 
> 	> /*
> 	>  * Tracers may want to know about even ignored signals.
> 	>  */
> 	> return !t->ptrace;
> 
> t is the tracer here but it shouldn't have t->ptrace because the child
> is not stopped. So do we need something like the following? Not tested
> yet
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1840c7f4e3c2..bd236ce4a29c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
>  > 	> if (!sig_task_ignored(t, sig, force))
>  > 	> 	> return 0;
>  
> +> 	> /* Do not ignore signals sent from child to the parent */
> +> 	> if (current->ptrace && current->parent == t)
> +> 	> 	> return 0;
> +
>  > 	> /*
>  > 	>  * Tracers may want to know about even ignored signals.
>  > 	>  */

This patch doesn't help, nor does the previous patch... but with both
applied, all is well.  All you have to do now is figure out why :)

	-Mike

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  9:40       ` Mike Galbraith
@ 2016-09-22  9:53         ` Michal Hocko
  2016-09-22 10:09           ` Mike Galbraith
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-09-22  9:53 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu 22-09-16 11:40:09, Mike Galbraith wrote:
> On Thu, 2016-09-22 at 10:36 +0200, Michal Hocko wrote:
> > On Thu 22-09-16 10:01:26, Michal Hocko wrote:
> > > On Thu 22-09-16 06:15:02, Mike Galbraith wrote:
> > > [...]
> > > > master.today...
> > > 
> > > Thanks for trying to reproduce this. My tiny laptop (2 cores, 2 threads
> > > per core) cannot reproduce even in 10 minutes or so. I've tried to use
> > > the same machine I was testing with 3.12 kernel (2 sockets, 8 cores per
> > > soc. and 2 threas per core) and it hit almost instantly. I have tried 
> > > mutex_lock_killable -> interruptible and it didn't help as I've
> > > expected. So the current kernel doesn't do any magic to prevent from the
> > > issue as well.
> > > 
> > > So I've stared into do_notify_parent some more and the following was
> > > just very confusing
> > > 
> > > 	> > if (!tsk->ptrace && sig == SIGCHLD &&
> > > 	> >     (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
> > > 	> >      (psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
> > > 	> > 	> > /*
> > > 	> > 	> >  * We are exiting and our parent doesn't care.  POSIX.1
> > > 	> > 	> >  * defines special semantics for setting SIGCHLD to SIG_IGN
> > > 	> > 	> >  * or setting the SA_NOCLDWAIT flag: we should be reaped
> > > 	> > 	> >  * automatically and not left for our parent's wait4 call.
> > > 	> > 	> >  * Rather than having the parent do it as a magic kind of
> > > 	> > 	> >  * signal handler, we just set this to tell do_exit that we
> > > 	> > 	> >  * can be cleaned up without becoming a zombie.  Note that
> > > 	> > 	> >  * we still call __wake_up_parent in this case, because a
> > > 	> > 	> >  * blocked sys_wait4 might now return -ECHILD.
> > > 	> > 	> >  *
> > > 	> > 	> >  * Whether we send SIGCHLD or not for SA_NOCLDWAIT
> > > 	> > 	> >  * is implementation-defined: we do (if you don't want
> > > 	> > 	> >  * it, just use SIG_IGN instead).
> > > 	> > 	> >  */
> > > 	> > 	> > autoreap = true;
> > > 	> > 	> > if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
> > > 	> > 	> > 	> > sig = 0;
> > > 	> > }
> > > 
> > > it tries to prevent from what I am seeing in a way. If the SIGCHLD is
> > > ignored then it just does autoreap and everything is fine. But this
> > > doesn't seem to be the case here. In fact we are not sending the signal
> > > because sig_task_ignored is true resp. sig_handler_ignored which can
> > > fail even for handler == SIG_DFL && sig_kernel_ignore() and SIGCHLD
> > > seems to be in SIG_KERNEL_IGNORE_MASK. So I've tried
> > 
> > Dohh, I've missed !tsk->ptrace check there. So we are not even going
> > that via that path. So the sig_handler_ignored cannot possible help.
> > I was just too lucky and didn't hit the lockup with the patch.
> > 
> > So what else might be wrong here? sig_ignored seems to be quite
> > confusing
> > 
> > 	> /*
> > 	>  * Tracers may want to know about even ignored signals.
> > 	>  */
> > 	> return !t->ptrace;
> > 
> > t is the tracer here but it shouldn't have t->ptrace because the child
> > is not stopped. So do we need something like the following? Not tested
> > yet
> > 
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1840c7f4e3c2..bd236ce4a29c 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
> >  > 	> if (!sig_task_ignored(t, sig, force))
> >  > 	> 	> return 0;
> >  
> > +> 	> /* Do not ignore signals sent from child to the parent */
> > +> 	> if (current->ptrace && current->parent == t)
> > +> 	> 	> return 0;
> > +
> >  > 	> /*
> >  > 	>  * Tracers may want to know about even ignored signals.
> >  > 	>  */
> 
> This patch doesn't help, nor does the previous patch... but with both
> applied, all is well.  All you have to do now is figure out why :)

Ohh, I should be more explicit, this needs the mm_access part as well.
Sorry for not being clear enough. So the full change is
---
diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..5c8b84b76f0b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
 	if (!sig_task_ignored(t, sig, force))
 		return 0;
 
+	/* Do not ignore signals sent from child to the parent */
+	if (current->ptrace && current->parent == t)
+		return 0;
+
 	/*
 	 * Tracers may want to know about even ignored signals.
 	 */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22  9:53         ` Michal Hocko
@ 2016-09-22 10:09           ` Mike Galbraith
  2016-09-22 11:09             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2016-09-22 10:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu, 2016-09-22 at 11:53 +0200, Michal Hocko wrote:
> On Thu 22-09-16 11:40:09, Mike Galbraith wrote:

> > This patch doesn't help, nor does the previous patch... but with both
> > applied, all is well.  All you have to do now is figure out why :)
> 
> Ohh, I should be more explicit, this needs the mm_access part as well.
> Sorry for not being clear enough. So the full change is

Ah.  That was gonna happen after lunch, but since you already know it
works, I can get back to un-b0rking one of my trees.

	-Mike

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22 10:09           ` Mike Galbraith
@ 2016-09-22 11:09             ` Michal Hocko
  2016-09-22 13:53               ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-09-22 11:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, strace-devel, Oleg Nesterov, Aleksa Sarai

On Thu 22-09-16 12:09:05, Mike Galbraith wrote:
> On Thu, 2016-09-22 at 11:53 +0200, Michal Hocko wrote:
> > On Thu 22-09-16 11:40:09, Mike Galbraith wrote:
> 
> > > This patch doesn't help, nor does the previous patch... but with both
> > > applied, all is well.  All you have to do now is figure out why :)
> > 
> > Ohh, I should be more explicit, this needs the mm_access part as well.
> > Sorry for not being clear enough. So the full change is
> 
> Ah.  That was gonna happen after lunch, but since you already know it
> works, I can get back to un-b0rking one of my trees.

Yeah, it should work. The testcase is running in a loop for more than
hour already and everything seems to be ok. Now the question is whether
the fix is really correct which is something for Oleg.

Also I think there is at least one more issue lurking there. Without or
without the patch I can make tracer get stuck in do_wait waiting for
task which doesn't seem to be alive anymore. I will report it separately
after this one gets resolved to not pull more confusion in.

Anyway, thanks for your time Mike!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22 11:09             ` Michal Hocko
@ 2016-09-22 13:53               ` Michal Hocko
  2016-09-23  1:17                 ` Aleksa Sarai
  2016-09-23 10:21                 ` Oleg Nesterov
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-09-22 13:53 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On Thu 22-09-16 13:09:25, Michal Hocko wrote:
> On Thu 22-09-16 12:09:05, Mike Galbraith wrote:
> > On Thu, 2016-09-22 at 11:53 +0200, Michal Hocko wrote:
> > > On Thu 22-09-16 11:40:09, Mike Galbraith wrote:
> > 
> > > > This patch doesn't help, nor does the previous patch... but with both
> > > > applied, all is well.  All you have to do now is figure out why :)
> > > 
> > > Ohh, I should be more explicit, this needs the mm_access part as well.
> > > Sorry for not being clear enough. So the full change is
> > 
> > Ah.  That was gonna happen after lunch, but since you already know it
> > works, I can get back to un-b0rking one of my trees.
> 
> Yeah, it should work. The testcase is running in a loop for more than
> hour already and everything seems to be ok. Now the question is whether
> the fix is really correct which is something for Oleg.
> 
> Also I think there is at least one more issue lurking there. Without or
> without the patch I can make tracer get stuck in do_wait waiting for
> task which doesn't seem to be alive anymore. I will report it separately
> after this one gets resolved to not pull more confusion in.

OK, the test in the loop has survived 3h of runtime without a single
lockup so the patch seems to be working for this case. I am posting the
patch with the full changelog, let's see if it is correct as well. As
I've said earlier this might be a strace bug which does an unexpected
syscall while it should be doing wait on the child process instead.

If the patch is correct then I would mark it for stable as well.
---
>From fe82d463fd2ef1585d2c37bf9fa6a1761e6ee0e5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 22 Sep 2016 10:09:34 +0200
Subject: [PATCH] signal: always signal tracer from the ptraced task

Aleksa has reported the following lockup when stracing the following go
program

% cat exec.go
package main

import (
    "os"
    "syscall"
)

func main() {
    syscall.Exec("/bin/echo", []string{"/bin/echo", "Hello"}, os.Environ())
}
$ go version
go version go1.6.3 linux/amd64
$ go build -o exec exec.go

$ strace -f ./exec
[snip]
[pid 10349] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10346] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10345] execve("/bin/echo", ["/bin/echo", "Hello"], [/* 95 vars */] <unfinished ...>
[pid 10346] <... select resumed> )      = 0 (Timeout)
[pid 10349] <... select resumed> )      = 0 (Timeout)

execve will never finish unless the strace process get killed with
SIGKILL. The reason is the following deadlock

tracer				thread_A				thread_$N
SyS_process_vm_readv		SyS_execve				do_exit
				  do_execveat_common			  exit_notify
				    prepare_bprm_creds			    do_notify_parent
				      mutex_lock(cred_guard_mutex)	      __group_send_sig_info
				    search_binary_handler		        send_signal
				      load_elf_binary			          prepare_signal -> fail SIGHCHLD is SGL_DFL
				        flush_old_exec
				          # wait for sig->notify_count
  process_vm_rw
    process_vm_rw_core
      mm_access
        mutex_lock_killable(cred_guard_mutex)

So there seems to be 2 issues here. The first one is that exiting
threads (because of the ongoing exec) are not sending SIGCHLD signal
to the tracer but they rely on the tracer to reap them and call
release_task->__exit_signal which in turn would wake up the thread_A and
move on with the exec. The other part of the story is that the tracer
is not in do_wait but rather calls process_vm_readv (presumably to get
arguments of the syscall) and it waits for a lock in killable rather
than interruptible sleep.

The fix is therefore twofold. We want to teach mm_access to sleep in
interruptible sleep and we want to make sure that the traced child
will send the signal to the parent even when it is ignored or SIG_DFL.
sig_ignored already seems to be doing something along that line except
it doesn't handle when a traced child sends a signal to the tracer.
Fix this by checking the current ptrace status and whether the target
task is the tracer.

Reported-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/fork.c   | 2 +-
 kernel/signal.c | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 5a57b9bab85c..d5b7c3aea187 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -837,7 +837,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
 	struct mm_struct *mm;
 	int err;
 
-	err =  mutex_lock_killable(&task->signal->cred_guard_mutex);
+	err =  mutex_lock_interruptible(&task->signal->cred_guard_mutex);
 	if (err)
 		return ERR_PTR(err);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 96e9bc40667f..5c8b84b76f0b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
 	if (!sig_task_ignored(t, sig, force))
 		return 0;
 
+	/* Do not ignore signals sent from child to the parent */
+	if (current->ptrace && current->parent == t)
+		return 0;
+
 	/*
 	 * Tracers may want to know about even ignored signals.
 	 */
-- 
2.9.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22 13:53               ` Michal Hocko
@ 2016-09-23  1:17                 ` Aleksa Sarai
  2016-09-23 10:21                 ` Oleg Nesterov
  1 sibling, 0 replies; 19+ messages in thread
From: Aleksa Sarai @ 2016-09-23  1:17 UTC (permalink / raw)
  To: Michal Hocko, Oleg Nesterov; +Cc: LKML, strace-devel, Mike Galbraith

>>>>> This patch doesn't help, nor does the previous patch... but with both
>>>>> applied, all is well.  All you have to do now is figure out why :)
>>>>
>>>> Ohh, I should be more explicit, this needs the mm_access part as well.
>>>> Sorry for not being clear enough. So the full change is
>>>
>>> Ah.  That was gonna happen after lunch, but since you already know it
>>> works, I can get back to un-b0rking one of my trees.
>>
>> Yeah, it should work. The testcase is running in a loop for more than
>> hour already and everything seems to be ok. Now the question is whether
>> the fix is really correct which is something for Oleg.
>>
>> Also I think there is at least one more issue lurking there. Without or
>> without the patch I can make tracer get stuck in do_wait waiting for
>> task which doesn't seem to be alive anymore. I will report it separately
>> after this one gets resolved to not pull more confusion in.
>
> OK, the test in the loop has survived 3h of runtime without a single
> lockup so the patch seems to be working for this case. I am posting the
> patch with the full changelog, let's see if it is correct as well. As
> I've said earlier this might be a strace bug which does an unexpected
> syscall while it should be doing wait on the child process instead.
>
> If the patch is correct then I would mark it for stable as well.

This patch fixes the test case, but it also fixes the original issue 
(that strace of runC would fail when the container init process is 
exec-ing the user init process). Thanks Michal, I'll apply it to my 
local kernel. :D

-- 
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-21 15:29 strace lockup when tracing exec in go Michal Hocko
  2016-09-22  4:15 ` Mike Galbraith
@ 2016-09-23  9:50 ` Oleg Nesterov
  2016-09-23 11:23   ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2016-09-23  9:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Aleksa Sarai

Sorry for delay, I was offline. I'll try to return to this problem next
week, currently I can't even read this thread but at first glance the
proposed patch(es) do not look right...

On 09/21, Michal Hocko wrote:
>
> The further investigation shown that the tracer (strace) is stuck
> waiting for cred_guard_mutex
> [<0000000000000000>] mm_access+0x22/0xa0
> [<0000000000000000>] process_vm_rw_core.isra.1+0x112/0x6c0
> [<0000000000000000>] process_vm_rw+0xab/0x110
> [<0000000000000000>] SyS_process_vm_readv+0x15/0x20
> [<0000000000000000>] system_call_fastpath+0x16/0x1b
> [<0000000000000000>] 0x7f186f031c3a
> [<0000000000000000>] 0xffffffffffffffff
...
> this however doesn't happen because both threads which are dead
> are zombies waiting to be reaped by the parent and to call their
> release_task->__exit_signal.

Yes, I know, this was already reported. And so far I do not see any
reasonable fix. I _think_ that the "real" fix should rework the
security_bprm_* helpers, but unlikely this is possible.

The trivial test-case:

	void *thread(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return NULL;
	}

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			pthread_t pt;
			pthread_create(&pt, NULL, thread, NULL);
			pthread_join(pt, NULL);
			execlp("echo", "echo", "passed", NULL);
		}

		sleep(1);
		// or anything else which needs ->cred_guard_mutex,
		// say open(/proc/$pid/mem)
		ptrace(PTRACE_ATTACH, pid, 0,0);
		kill(pid, SIGCONT);

		return 0;
	}

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-22 13:53               ` Michal Hocko
  2016-09-23  1:17                 ` Aleksa Sarai
@ 2016-09-23 10:21                 ` Oleg Nesterov
  2016-09-23 11:18                   ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2016-09-23 10:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On 09/22, Michal Hocko wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
>  	if (!sig_task_ignored(t, sig, force))
>  		return 0;
>  
> +	/* Do not ignore signals sent from child to the parent */
> +	if (current->ptrace && current->parent == t)
> +		return 0;

This doesn't look right in general, and this can't really help.

This assumes that the tracer will call do_wait() after mm_access()
fails, but this is not necessarily true.

Note also ptrace_attach(), -ERESTARTNOINTR means that the tracer won't
even return to user-space if SIGCHLD is ignored, the tracer will silently
restart the syscall.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23 10:21                 ` Oleg Nesterov
@ 2016-09-23 11:18                   ` Michal Hocko
  2016-09-23 13:21                     ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-09-23 11:18 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On Fri 23-09-16 12:21:41, Oleg Nesterov wrote:
> On 09/22, Michal Hocko wrote:
> >
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
> >  	if (!sig_task_ignored(t, sig, force))
> >  		return 0;
> >  
> > +	/* Do not ignore signals sent from child to the parent */
> > +	if (current->ptrace && current->parent == t)
> > +		return 0;
> 
> This doesn't look right in general, and this can't really help.
> 
> This assumes that the tracer will call do_wait() after mm_access()
> fails, but this is not necessarily true.
> 
> Note also ptrace_attach(), -ERESTARTNOINTR means that the tracer won't
> even return to user-space if SIGCHLD is ignored, the tracer will silently
> restart the syscall.

Well, it apparently does help the strace case. So I am not arguing this
is the best fix but can it be harmful?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23  9:50 ` Oleg Nesterov
@ 2016-09-23 11:23   ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-09-23 11:23 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, strace-devel, Aleksa Sarai

On Fri 23-09-16 11:50:32, Oleg Nesterov wrote:
> Sorry for delay, I was offline. I'll try to return to this problem next
> week, currently I can't even read this thread but at first glance the
> proposed patch(es) do not look right...
> 
> On 09/21, Michal Hocko wrote:
> >
> > The further investigation shown that the tracer (strace) is stuck
> > waiting for cred_guard_mutex
> > [<0000000000000000>] mm_access+0x22/0xa0
> > [<0000000000000000>] process_vm_rw_core.isra.1+0x112/0x6c0
> > [<0000000000000000>] process_vm_rw+0xab/0x110
> > [<0000000000000000>] SyS_process_vm_readv+0x15/0x20
> > [<0000000000000000>] system_call_fastpath+0x16/0x1b
> > [<0000000000000000>] 0x7f186f031c3a
> > [<0000000000000000>] 0xffffffffffffffff
> ...
> > this however doesn't happen because both threads which are dead
> > are zombies waiting to be reaped by the parent and to call their
> > release_task->__exit_signal.
> 
> Yes, I know, this was already reported. And so far I do not see any
> reasonable fix. I _think_ that the "real" fix should rework the
> security_bprm_* helpers, but unlikely this is possible.

Rework them to not rely on the cred_guard_mutex?  Is there any way to
workaround this in the strace code?

> The trivial test-case:
> 
> 	void *thread(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return NULL;
> 	}
> 
> 	int main(void)
> 	{
> 		int pid = fork();
> 
> 		if (!pid) {
> 			pthread_t pt;
> 			pthread_create(&pt, NULL, thread, NULL);
> 			pthread_join(pt, NULL);
> 			execlp("echo", "echo", "passed", NULL);
> 		}
> 
> 		sleep(1);
> 		// or anything else which needs ->cred_guard_mutex,
> 		// say open(/proc/$pid/mem)
> 		ptrace(PTRACE_ATTACH, pid, 0,0);
> 		kill(pid, SIGCONT);
> 
> 		return 0;
> 	}
> 
> Oleg.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23 11:18                   ` Michal Hocko
@ 2016-09-23 13:21                     ` Oleg Nesterov
  2016-09-23 13:40                       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2016-09-23 13:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On 09/23, Michal Hocko wrote:
>
> On Fri 23-09-16 12:21:41, Oleg Nesterov wrote:
> > On 09/22, Michal Hocko wrote:
> > >
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
> > >  	if (!sig_task_ignored(t, sig, force))
> > >  		return 0;
> > >
> > > +	/* Do not ignore signals sent from child to the parent */
> > > +	if (current->ptrace && current->parent == t)
> > > +		return 0;
> >
> > This doesn't look right in general, and this can't really help.
> >
> > This assumes that the tracer will call do_wait() after mm_access()
> > fails, but this is not necessarily true.
> >
> > Note also ptrace_attach(), -ERESTARTNOINTR means that the tracer won't
> > even return to user-space if SIGCHLD is ignored, the tracer will silently
> > restart the syscall.
>
> Well, it apparently does help the strace case.

Only because strace doesn't even try to handle -EINTR; it assumes this is not
possible, gives up, and calls wait() after that. So this change actually
breaks strace.

And once again, this can't really help. SIGCHLD can come before strace calls
process_vm_readv(), and in this case it will enter the syscall without
signal_pending() == T. IOW, this hack can only help if the tracer already
sleeps in process_vm_readv().

Plus, again, "strace -f" can equally hang if mt-exec races with PTRACE_ATTACH.

> So I am not arguing this
> is the best fix but can it be harmful?

This change is simply wrong no matter what. We could change do_notify_parent()
to call signal_wake_up() if tsk->ptrace, but see above, this won't help.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23 13:21                     ` Oleg Nesterov
@ 2016-09-23 13:40                       ` Michal Hocko
  2016-09-23 14:07                         ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-09-23 13:40 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On Fri 23-09-16 15:21:02, Oleg Nesterov wrote:
> On 09/23, Michal Hocko wrote:
> >
> > On Fri 23-09-16 12:21:41, Oleg Nesterov wrote:
> > > On 09/22, Michal Hocko wrote:
> > > >
> > > > --- a/kernel/signal.c
> > > > +++ b/kernel/signal.c
> > > > @@ -91,6 +91,10 @@ static int sig_ignored(struct task_struct *t, int sig, bool force)
> > > >  	if (!sig_task_ignored(t, sig, force))
> > > >  		return 0;
> > > >
> > > > +	/* Do not ignore signals sent from child to the parent */
> > > > +	if (current->ptrace && current->parent == t)
> > > > +		return 0;
> > >
> > > This doesn't look right in general, and this can't really help.
> > >
> > > This assumes that the tracer will call do_wait() after mm_access()
> > > fails, but this is not necessarily true.
> > >
> > > Note also ptrace_attach(), -ERESTARTNOINTR means that the tracer won't
> > > even return to user-space if SIGCHLD is ignored, the tracer will silently
> > > restart the syscall.
> >
> > Well, it apparently does help the strace case.
> 
> Only because strace doesn't even try to handle -EINTR; it assumes this is not
> possible, gives up, and calls wait() after that. So this change actually
> breaks strace.

Hmm, OK. In this case process_vm_readv wouldn't give a usuful data which
still sounds better to me than a hang.

> And once again, this can't really help. SIGCHLD can come before strace calls
> process_vm_readv(), and in this case it will enter the syscall without
> signal_pending() == T. IOW, this hack can only help if the tracer already
> sleeps in process_vm_readv().

True.

> Plus, again, "strace -f" can equally hang if mt-exec races with PTRACE_ATTACH.
> 
> > So I am not arguing this
> > is the best fix but can it be harmful?
> 
> This change is simply wrong no matter what.

I've just tried to extend the existing

	/*
	 * Tracers may want to know about even ignored signals.
	 */
	return !t->ptrace;

but I probably just do not understand what that actually means. I
thought that the tracer is _really_ interested in hearing about the
signal.

> We could change do_notify_parent()
> to call signal_wake_up() if tsk->ptrace, but see above, this won't help.

So does this mean WONTFIX? Can we at least document this behavior? It
surely is unexpected.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23 13:40                       ` Michal Hocko
@ 2016-09-23 14:07                         ` Oleg Nesterov
  2016-09-23 14:19                           ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2016-09-23 14:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On 09/23, Michal Hocko wrote:
>
> On Fri 23-09-16 15:21:02, Oleg Nesterov wrote:
> >
> > This change is simply wrong no matter what.
>
> I've just tried to extend the existing
>
> 	/*
> 	 * Tracers may want to know about even ignored signals.
> 	 */
> 	return !t->ptrace;
>
> but I probably just do not understand what that actually means. I
> thought that the tracer is _really_ interested in hearing about the
> signal.

Yes, the tracer is really interested to know that a signal was sent to
the _tracee_, not the tracer ;)

>
> > We could change do_notify_parent()
> > to call signal_wake_up() if tsk->ptrace, but see above, this won't help.
>
> So does this mean WONTFIX? Can we at least document this behavior? It
> surely is unexpected.

No, no, no. Of course this must be fixed. The only problem is that I still
do not know what should we do. I'll try to return to this problem next week.
I'm afraid we will need to change de_thread() to wait until all other sub-
threads have passed exit_notify() or even exit_signals(), but ooh I don't
like this. Plus in this case we will need to finally define what
PTRACE_EVENT_EXIT should actually do.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: strace lockup when tracing exec in go
  2016-09-23 14:07                         ` Oleg Nesterov
@ 2016-09-23 14:19                           ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-09-23 14:19 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: LKML, strace-devel, Mike Galbraith, Aleksa Sarai

On Fri 23-09-16 16:07:24, Oleg Nesterov wrote:
> On 09/23, Michal Hocko wrote:
> >
> > On Fri 23-09-16 15:21:02, Oleg Nesterov wrote:
> > >
> > > This change is simply wrong no matter what.
> >
> > I've just tried to extend the existing
> >
> > 	/*
> > 	 * Tracers may want to know about even ignored signals.
> > 	 */
> > 	return !t->ptrace;
> >
> > but I probably just do not understand what that actually means. I
> > thought that the tracer is _really_ interested in hearing about the
> > signal.
> 
> Yes, the tracer is really interested to know that a signal was sent to
> the _tracee_, not the tracer ;)

OK, now it makes more sense. I was really scratching my head to
understand this part...
 
> > > We could change do_notify_parent()
> > > to call signal_wake_up() if tsk->ptrace, but see above, this won't help.
> >
> > So does this mean WONTFIX? Can we at least document this behavior? It
> > surely is unexpected.
> 
> No, no, no. Of course this must be fixed. The only problem is that I still
> do not know what should we do. I'll try to return to this problem next week.
> I'm afraid we will need to change de_thread() to wait until all other sub-
> threads have passed exit_notify() or even exit_signals(),

Yes making de_thread completely independent on the state of the tracer
would be a huge improvement. While playing with this test case I
triggered some other interesting hangs (e.g. strace hanging in tty while
trying to print something). So just few interruptible waits is not a
full solution.

> but ooh I don't
> like this. Plus in this case we will need to finally define what
> PTRACE_EVENT_EXIT should actually do.

OK, considering this has been broken for quite some time I do not think
we are in hurry. I am slightly worried about how such a solution would
be stable kernel safe but ohh well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-09-23 14:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-21 15:29 strace lockup when tracing exec in go Michal Hocko
2016-09-22  4:15 ` Mike Galbraith
2016-09-22  8:01   ` Michal Hocko
2016-09-22  8:20     ` Aleksa Sarai
2016-09-22  8:36     ` Michal Hocko
2016-09-22  9:40       ` Mike Galbraith
2016-09-22  9:53         ` Michal Hocko
2016-09-22 10:09           ` Mike Galbraith
2016-09-22 11:09             ` Michal Hocko
2016-09-22 13:53               ` Michal Hocko
2016-09-23  1:17                 ` Aleksa Sarai
2016-09-23 10:21                 ` Oleg Nesterov
2016-09-23 11:18                   ` Michal Hocko
2016-09-23 13:21                     ` Oleg Nesterov
2016-09-23 13:40                       ` Michal Hocko
2016-09-23 14:07                         ` Oleg Nesterov
2016-09-23 14:19                           ` Michal Hocko
2016-09-23  9:50 ` Oleg Nesterov
2016-09-23 11:23   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).