linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
@ 2015-10-20 17:17 Oleg Nesterov
  2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-20 17:17 UTC (permalink / raw)
  To: Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

Damn. I simply do not know what should/can we do. From the change
log:

	And I can only hope that this won't break something.

yet this patch cc's -stable.


Please see the changelog, but in short: this is not a kernel bug
but unlikely we can fix all distributions, so I think we have to
change the kernel.

HOWEVER. With this change __WCLONE and __WALL have no effect for
debugger, do_wait() works as if __WALL is set if the child (natural
or not) is traced.


Jan, Pedro, could you please confirm this won't break gdb? I tried
to look into gdb-7.1, and at first glance gdb uses __WCLONE only
because __WALL doesn't work on older kernels, iow it seems to me
that gdb actually wants __WALL so this change should be fine.


Any other ideas?

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 17:17 [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
@ 2015-10-20 17:17 ` Oleg Nesterov
  2015-10-20 22:31   ` Andrew Morton
  2015-10-22 13:51   ` Denys Vlasenko
  2015-10-20 17:17 ` [PATCH 2/2] wait: allow sys_waitid() to use __WNOTHREAD/__WCLONE/__WALL Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-20 17:17 UTC (permalink / raw)
  To: Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

The following program (simplified version of generated by syzkaller)

	#include <pthread.h>
	#include <unistd.h>
	#include <sys/ptrace.h>
	#include <stdio.h>
	#include <signal.h>

	void *thread_func(void *arg)
	{
		ptrace(PTRACE_TRACEME, 0,0,0);
		return 0;
	}

	int main(void)
	{
		pthread_t thread;

		if (fork())
			return 0;

		while (getppid() != 1)
			;

		pthread_create(&thread, NULL, thread_func, NULL);
		pthread_join(thread, NULL);
		return 0;
	}

creates the unreapable zombie if /sbin/init doesn't use __WALL.

This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap
the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.

This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal
is mostly ignored when the tracee reports to debugger.

This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this
won't break something.

We could make a more conservative change. Say, we can take __WCLONE
into account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
---
 kernel/exit.c |   29 ++++++++++++++++++++---------
 1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..819f51e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -914,17 +914,28 @@ static int eligible_pid(struct wait_opts *wo, struct task_struct *p)
 		task_pid_type(p, wo->wo_type) == wo->wo_pid;
 }
 
-static int eligible_child(struct wait_opts *wo, struct task_struct *p)
+static int
+eligible_child(struct wait_opts *wo, bool ptrace, struct task_struct *p)
 {
 	if (!eligible_pid(wo, p))
 		return 0;
-	/* Wait for all children (clone and not) if __WALL is set;
-	 * otherwise, wait for clone children *only* if __WCLONE is
-	 * set; otherwise, wait for non-clone children *only*.  (Note:
-	 * A "clone" child here is one that reports to its parent
-	 * using a signal other than SIGCHLD.) */
-	if (((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
-	    && !(wo->wo_flags & __WALL))
+
+	/*
+	 * Wait for all children (clone and not) if __WALL is set or
+	 * if it is traced by us.
+	 */
+	if (ptrace || (wo->wo_flags & __WALL))
+		return 1;
+
+	/*
+	 * Otherwise, wait for clone children *only* if __WCLONE is set;
+	 * otherwise, wait for non-clone children *only*.
+	 *
+	 * Note: a "clone" child here is one that reports to its parent
+	 * using a signal other than SIGCHLD, or a non-leader thread which
+	 * we can only see if it is traced by us.
+	 */
+	if ((p->exit_signal != SIGCHLD) ^ !!(wo->wo_flags & __WCLONE))
 		return 0;
 
 	return 1;
@@ -1297,7 +1308,7 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
 	if (unlikely(exit_state == EXIT_DEAD))
 		return 0;
 
-	ret = eligible_child(wo, p);
+	ret = eligible_child(wo, ptrace, p);
 	if (!ret)
 		return ret;
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] wait: allow sys_waitid() to use __WNOTHREAD/__WCLONE/__WALL
  2015-10-20 17:17 [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
  2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
@ 2015-10-20 17:17 ` Oleg Nesterov
  2015-10-20 17:36 ` [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
  2015-10-22 14:40 ` Pedro Alves
  3 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-20 17:17 UTC (permalink / raw)
  To: Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 819f51e..c090738 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1532,7 +1532,8 @@ SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
 	enum pid_type type;
 	long ret;
 
-	if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED))
+	if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
+			__WNOTHREAD|__WCLONE|__WALL))
 		return -EINVAL;
 	if (!(options & (WEXITED|WSTOPPED|WCONTINUED)))
 		return -EINVAL;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 17:17 [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
  2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
  2015-10-20 17:17 ` [PATCH 2/2] wait: allow sys_waitid() to use __WNOTHREAD/__WCLONE/__WALL Oleg Nesterov
@ 2015-10-20 17:36 ` Oleg Nesterov
  2015-10-22 14:40 ` Pedro Alves
  3 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-20 17:36 UTC (permalink / raw)
  To: Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

Forgot to say...

Another question is why PTRACE_TRACEME succeeds in this case. I guess
it is to late to change (break) the rules, but I never understood the
security checks. The comment above cap_ptrace_traceme() says:

	Determine whether another process may trace the current

and "another process" is parent. To me this looks strange, imo we should
determine whether the current may abuse its parent. So perhaps we could
change ptrace_traceme() to fail if

	current->parent_exec_id != parent->self_exec_id

?

But this too can break something. Although I can't imagine why the
child reaper or a PR_SET_CHILD_SUBREAPER process may want to trace
the reparented tasks.

On 10/20, Oleg Nesterov wrote:
>
> Damn. I simply do not know what should/can we do. From the change
> log:
>
> 	And I can only hope that this won't break something.
>
> yet this patch cc's -stable.
>
>
> Please see the changelog, but in short: this is not a kernel bug
> but unlikely we can fix all distributions, so I think we have to
> change the kernel.
>
> HOWEVER. With this change __WCLONE and __WALL have no effect for
> debugger, do_wait() works as if __WALL is set if the child (natural
> or not) is traced.
>
>
> Jan, Pedro, could you please confirm this won't break gdb? I tried
> to look into gdb-7.1, and at first glance gdb uses __WCLONE only
> because __WALL doesn't work on older kernels, iow it seems to me
> that gdb actually wants __WALL so this change should be fine.
>
>
> Any other ideas?
>
> Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
@ 2015-10-20 22:31   ` Andrew Morton
  2015-10-21  3:27     ` Vasily Averin
                       ` (2 more replies)
  2015-10-22 13:51   ` Denys Vlasenko
  1 sibling, 3 replies; 21+ messages in thread
From: Andrew Morton @ 2015-10-20 22:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dmitry Vyukov, Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> The following program (simplified version of generated by syzkaller)
> 
> 	#include <pthread.h>
> 	#include <unistd.h>
> 	#include <sys/ptrace.h>
> 	#include <stdio.h>
> 	#include <signal.h>
> 
> 	void *thread_func(void *arg)
> 	{
> 		ptrace(PTRACE_TRACEME, 0,0,0);
> 		return 0;
> 	}
> 
> 	int main(void)
> 	{
> 		pthread_t thread;
> 
> 		if (fork())
> 			return 0;
> 
> 		while (getppid() != 1)
> 			;
> 
> 		pthread_create(&thread, NULL, thread_func, NULL);
> 		pthread_join(thread, NULL);
> 		return 0;
> 	}
> 
> creates the unreapable zombie if /sbin/init doesn't use __WALL.
> 
> This is not a kernel bug, at least in a sense that everything works as
> expected: debugger should reap a traced sub-thread before it can reap
> the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
> 
> Unfortunately, it seems that /sbin/init in most (all?) distributions
> doesn't use it and we have to change the kernel to avoid the problem.

Well, to fix this a distro needs to roll out a new kernel.  Or a new
init(8).  Is there any reason to believe that distributing/deploying a
new kernel is significantly easier for everyone?  Because fixing init
sounds like a much preferable solution to this problem.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 22:31   ` Andrew Morton
@ 2015-10-21  3:27     ` Vasily Averin
  2015-10-21 17:41     ` Oleg Nesterov
  2015-10-21 19:59     ` Denys Vlasenko
  2 siblings, 0 replies; 21+ messages in thread
From: Vasily Averin @ 2015-10-21  3:27 UTC (permalink / raw)
  To: syzkaller, Oleg Nesterov
  Cc: Dmitry Vyukov, Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, linux-kernel

On 21.10.2015 01:31, Andrew Morton wrote:
> On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> 
>> The following program (simplified version of generated by syzkaller)
>>
>> 	#include <pthread.h>
>> 	#include <unistd.h>
>> 	#include <sys/ptrace.h>
>> 	#include <stdio.h>
>> 	#include <signal.h>
>>
>> 	void *thread_func(void *arg)
>> 	{
>> 		ptrace(PTRACE_TRACEME, 0,0,0);
>> 		return 0;
>> 	}
>>
>> 	int main(void)
>> 	{
>> 		pthread_t thread;
>>
>> 		if (fork())
>> 			return 0;
>>
>> 		while (getppid() != 1)
>> 			;
>>
>> 		pthread_create(&thread, NULL, thread_func, NULL);
>> 		pthread_join(thread, NULL);
>> 		return 0;
>> 	}
>>
>> creates the unreapable zombie if /sbin/init doesn't use __WALL.
>>
>> This is not a kernel bug, at least in a sense that everything works as
>> expected: debugger should reap a traced sub-thread before it can reap
>> the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
>>
>> Unfortunately, it seems that /sbin/init in most (all?) distributions
>> doesn't use it and we have to change the kernel to avoid the problem.
> 
> Well, to fix this a distro needs to roll out a new kernel.  Or a new
> init(8).  Is there any reason to believe that distributing/deploying a
> new kernel is significantly easier for everyone?  Because fixing init
> sounds like a much preferable solution to this problem.

Patched kernel allows to run obsoleted distro inside containers.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 22:31   ` Andrew Morton
  2015-10-21  3:27     ` Vasily Averin
@ 2015-10-21 17:41     ` Oleg Nesterov
  2015-10-21 19:47       ` Andrew Morton
  2015-10-21 19:59     ` Denys Vlasenko
  2 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-21 17:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/20, Andrew Morton wrote:
>
> On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > The following program (simplified version of generated by syzkaller)
> >
> > 	#include <pthread.h>
> > 	#include <unistd.h>
> > 	#include <sys/ptrace.h>
> > 	#include <stdio.h>
> > 	#include <signal.h>
> >
> > 	void *thread_func(void *arg)
> > 	{
> > 		ptrace(PTRACE_TRACEME, 0,0,0);
> > 		return 0;
> > 	}
> >
> > 	int main(void)
> > 	{
> > 		pthread_t thread;
> >
> > 		if (fork())
> > 			return 0;
> >
> > 		while (getppid() != 1)
> > 			;
> >
> > 		pthread_create(&thread, NULL, thread_func, NULL);
> > 		pthread_join(thread, NULL);
> > 		return 0;
> > 	}
> >
> > creates the unreapable zombie if /sbin/init doesn't use __WALL.
> >
> > This is not a kernel bug, at least in a sense that everything works as
> > expected: debugger should reap a traced sub-thread before it can reap
> > the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
> >
> > Unfortunately, it seems that /sbin/init in most (all?) distributions
> > doesn't use it and we have to change the kernel to avoid the problem.
>
> Well, to fix this a distro needs to roll out a new kernel.  Or a new
> init(8).  Is there any reason to believe that distributing/deploying a
> new kernel is significantly easier for everyone? Because fixing init
> sounds like a much preferable solution to this problem.

I will be happy if we decide that this is userpace problem and we should
not fix the kernel. I simply do not know.

However, please look at 2/2 which imho makes sense regardless and looks
"obviously safe". Without this patch waitid() can not use __WALL, so if
/sbin/init uses waitid() then the userspace fix won't be one-liner. And
at least Fedora22 and Ubuntu use waitid().

So personally I'd prefer 2/2 + fix-init, not sure if this can work...

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 17:41     ` Oleg Nesterov
@ 2015-10-21 19:47       ` Andrew Morton
  2015-10-21 20:44         ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2015-10-21 19:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dmitry Vyukov, Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On Wed, 21 Oct 2015 19:41:50 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> On 10/20, Andrew Morton wrote:
> >
> > On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > The following program (simplified version of generated by syzkaller)
> > >
> > > 	#include <pthread.h>
> > > 	#include <unistd.h>
> > > 	#include <sys/ptrace.h>
> > > 	#include <stdio.h>
> > > 	#include <signal.h>
> > >
> > > 	void *thread_func(void *arg)
> > > 	{
> > > 		ptrace(PTRACE_TRACEME, 0,0,0);
> > > 		return 0;
> > > 	}
> > >
> > > 	int main(void)
> > > 	{
> > > 		pthread_t thread;
> > >
> > > 		if (fork())
> > > 			return 0;
> > >
> > > 		while (getppid() != 1)
> > > 			;
> > >
> > > 		pthread_create(&thread, NULL, thread_func, NULL);
> > > 		pthread_join(thread, NULL);
> > > 		return 0;
> > > 	}
> > >
> > > creates the unreapable zombie if /sbin/init doesn't use __WALL.
> > >
> > > This is not a kernel bug, at least in a sense that everything works as
> > > expected: debugger should reap a traced sub-thread before it can reap
> > > the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
> > >
> > > Unfortunately, it seems that /sbin/init in most (all?) distributions
> > > doesn't use it and we have to change the kernel to avoid the problem.
> >
> > Well, to fix this a distro needs to roll out a new kernel.  Or a new
> > init(8).  Is there any reason to believe that distributing/deploying a
> > new kernel is significantly easier for everyone? Because fixing init
> > sounds like a much preferable solution to this problem.
> 
> I will be happy if we decide that this is userpace problem and we should
> not fix the kernel. I simply do not know.

The kernel patch sounds pretty sketchy - something we should avoid
doing if at all possible.

> However, please look at 2/2 which imho makes sense regardless and looks
> "obviously safe". Without this patch waitid() can not use __WALL, so if
> /sbin/init uses waitid() then the userspace fix won't be one-liner. And
> at least Fedora22 and Ubuntu use waitid().

2/2 does look sensible (needs a better changelog if it's to be a
standalone thing), but if we're expecting distros to fix this with an
updated init(8) only, then they can't assume that the kernel's waitid()
has been altered.  So init will need the not-one-liner version of the
fix.

> So personally I'd prefer 2/2 + fix-init, not sure if this can work...

I'm just guessing here.  Are you (or someone) able to find out which
approach the distros will prefer, and why?

And what has to be done to init(8) to fix this bug when running current
kernels?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 22:31   ` Andrew Morton
  2015-10-21  3:27     ` Vasily Averin
  2015-10-21 17:41     ` Oleg Nesterov
@ 2015-10-21 19:59     ` Denys Vlasenko
  2015-10-21 20:31       ` Denys Vlasenko
  2 siblings, 1 reply; 21+ messages in thread
From: Denys Vlasenko @ 2015-10-21 19:59 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov
  Cc: Dmitry Vyukov, Alexander Potapenko, Eric Dumazet, Jan Kratochvil,
	Julien Tinnes, Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/21/2015 12:31 AM, Andrew Morton wrote:
> On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> 
>> The following program (simplified version of generated by syzkaller)
>>
>> 	#include <pthread.h>
>> 	#include <unistd.h>
>> 	#include <sys/ptrace.h>
>> 	#include <stdio.h>
>> 	#include <signal.h>
>>
>> 	void *thread_func(void *arg)
>> 	{
>> 		ptrace(PTRACE_TRACEME, 0,0,0);
>> 		return 0;
>> 	}
>>
>> 	int main(void)
>> 	{
>> 		pthread_t thread;
>>
>> 		if (fork())
>> 			return 0;
>>
>> 		while (getppid() != 1)
>> 			;
>>
>> 		pthread_create(&thread, NULL, thread_func, NULL);
>> 		pthread_join(thread, NULL);
>> 		return 0;
>> 	}
>>
>> creates the unreapable zombie if /sbin/init doesn't use __WALL.
>>
>> This is not a kernel bug, at least in a sense that everything works as
>> expected: debugger should reap a traced sub-thread before it can reap
>> the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
>>
>> Unfortunately, it seems that /sbin/init in most (all?) distributions
>> doesn't use it and we have to change the kernel to avoid the problem.
> 
> Well, to fix this a distro needs to roll out a new kernel.  Or a new
> init(8).  Is there any reason to believe that distributing/deploying a
> new kernel is significantly easier for everyone?  Because fixing init
> sounds like a much preferable solution to this problem.

People will continue to write new init(8) implementations,
and they will miss this obscure case.

Before this bug was found, it was considered possible to use
a shell script as init process. What now, every shell needs to add
__WALL to its waitpids?

The use of PTRACE_TRACEME in this reproducer is clearly pathological:
PTRACE_TRACEME was never intended to be used to attach to unsuspecting
processes.

How about making PTRACE_TRACEME fail in this case?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 19:59     ` Denys Vlasenko
@ 2015-10-21 20:31       ` Denys Vlasenko
  2015-10-21 21:47         ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Denys Vlasenko @ 2015-10-21 20:31 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov
  Cc: Dmitry Vyukov, Alexander Potapenko, Eric Dumazet, Jan Kratochvil,
	Julien Tinnes, Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/21/2015 09:59 PM, Denys Vlasenko wrote:
> On 10/21/2015 12:31 AM, Andrew Morton wrote:
>> On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
>>
>>> The following program (simplified version of generated by syzkaller)
>>>
>>> 	#include <pthread.h>
>>> 	#include <unistd.h>
>>> 	#include <sys/ptrace.h>
>>> 	#include <stdio.h>
>>> 	#include <signal.h>
>>>
>>> 	void *thread_func(void *arg)
>>> 	{
>>> 		ptrace(PTRACE_TRACEME, 0,0,0);
>>> 		return 0;
>>> 	}
>>>
>>> 	int main(void)
>>> 	{
>>> 		pthread_t thread;
>>>
>>> 		if (fork())
>>> 			return 0;
>>>
>>> 		while (getppid() != 1)
>>> 			;
>>>
>>> 		pthread_create(&thread, NULL, thread_func, NULL);
>>> 		pthread_join(thread, NULL);
>>> 		return 0;
>>> 	}
>>>
>>> creates the unreapable zombie if /sbin/init doesn't use __WALL.
>>>
>>> This is not a kernel bug, at least in a sense that everything works as
>>> expected: debugger should reap a traced sub-thread before it can reap
>>> the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
>>>
>>> Unfortunately, it seems that /sbin/init in most (all?) distributions
>>> doesn't use it and we have to change the kernel to avoid the problem.
>>
>> Well, to fix this a distro needs to roll out a new kernel.  Or a new
>> init(8).  Is there any reason to believe that distributing/deploying a
>> new kernel is significantly easier for everyone?  Because fixing init
>> sounds like a much preferable solution to this problem.
> 
> People will continue to write new init(8) implementations,
> and they will miss this obscure case.
> 
> Before this bug was found, it was considered possible to use
> a shell script as init process. What now, every shell needs to add
> __WALL to its waitpids?
> 
> The use of PTRACE_TRACEME in this reproducer is clearly pathological:
> PTRACE_TRACEME was never intended to be used to attach to unsuspecting
> processes.
> 
> How about making PTRACE_TRACEME fail in this case?

something like this


diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 787320d..285a58c 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -385,6 +385,17 @@ static int ptrace_traceme(void)
 	write_lock_irq(&tasklist_lock);
 	/* Are we already being traced? */
 	if (!current->ptrace) {
+		struct pid_namespace *pid_ns;
+
+		pid_ns = task_active_pid_ns(current->parent);
+		if (current->parent == pid_ns->child_reaper) {
+			/*
+			 * Our parent is init. We may be a reparented process
+			 * used for PTRACE_TRACEME zombie attack on init.
+			 */
+			goto nope;
+		}
+
 		ret = security_ptrace_traceme(current->parent);
 		/*
 		 * Check PF_EXITING to ensure ->real_parent has not passed
@@ -395,6 +406,7 @@ static int ptrace_traceme(void)
 			current->ptrace = PT_PTRACED;
 			__ptrace_link(current, current->real_parent);
 		}
+ nope: ;
 	}
 	write_unlock_irq(&tasklist_lock);



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 19:47       ` Andrew Morton
@ 2015-10-21 20:44         ` Oleg Nesterov
  0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-21 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/21, Andrew Morton wrote:
>
> On Wed, 21 Oct 2015 19:41:50 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 10/20, Andrew Morton wrote:
> > >
> > > On Tue, 20 Oct 2015 19:17:54 +0200 Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > This is not a kernel bug, at least in a sense that everything works as
> > > > expected: debugger should reap a traced sub-thread before it can reap
> > > > the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
> > > >
> > > > Unfortunately, it seems that /sbin/init in most (all?) distributions
> > > > doesn't use it and we have to change the kernel to avoid the problem.
> > >
> > > Well, to fix this a distro needs to roll out a new kernel.  Or a new
> > > init(8).  Is there any reason to believe that distributing/deploying a
> > > new kernel is significantly easier for everyone? Because fixing init
> > > sounds like a much preferable solution to this problem.
> >
> > I will be happy if we decide that this is userpace problem and we should
> > not fix the kernel. I simply do not know.
>
> The kernel patch sounds pretty sketchy - something we should avoid
> doing if at all possible.

Yes, I agree.

> > However, please look at 2/2 which imho makes sense regardless and looks
> > "obviously safe". Without this patch waitid() can not use __WALL, so if
> > /sbin/init uses waitid() then the userspace fix won't be one-liner. And
> > at least Fedora22 and Ubuntu use waitid().
>
> 2/2 does look sensible (needs a better changelog if it's to be a
> standalone thing),

Yes. Without 1/2 the changlelog should menetion that at least __WALL
makes sense because /sbin/init has a good reason to use waitid(WALL).

Plus it should cc -stable.

> but if we're expecting distros to fix this with an
> updated init(8) only, then they can't assume that the kernel's waitid()
> has been altered.

Well, 2/2 looks safe for every kernel version... starting from git
history at least.

> So init will need the not-one-liner version of the
> fix.

Then I think this fix will stay forever ;)

> > So personally I'd prefer 2/2 + fix-init, not sure if this can work...
>
> I'm just guessing here.  Are you (or someone) able to find out which
> approach the distros will prefer, and why?

No, I have no idea, sorry.

> And what has to be done to init(8) to fix this bug when running current
> kernels?

Say, http://git.busybox.net/busybox/tree/init/init.c
at first glance it just needs


	-	wpid = waitpid(-1, NULL, maybe_WNOHANG);
	+	wpid = waitpid(-1, NULL, maybe_WNOHANG | __WALL);


I have a testing machine running Fedora22, according to strace
/bin/systemd does

	waitid(P_ALL, 0, {}, WNOHANG|WEXITED|WNOWAIT, NULL);
	...
	waitid(P_PID, 25558, {INFO}, WEXITED, NULL)

so it probably wants siginfo and thus it can't use waitpid. Without
2/2 systemd can probably just do something like

	while (waitpid(-1, NULL, __WCLONE | WNOHANG) != ESRCH) {
		log("Dmitry Vyukov detected");
	}

every time it does waitid() to reap the traced subthreads.

Unless of course systemd itself uses ptrace or forks a child with
(clone_flags & CSIGNAL) != SIGCHLD. Unlikely, but who knows.

In any case I think the fix should be simple. 2/2 can help, most
probably systemd can too just add __WALL to wait options.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 20:31       ` Denys Vlasenko
@ 2015-10-21 21:47         ` Oleg Nesterov
  2015-10-21 23:27           ` Denys Vlasenko
  0 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-21 21:47 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andrew Morton, Dmitry Vyukov, Alexander Potapenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/21, Denys Vlasenko wrote:
>
> On 10/21/2015 09:59 PM, Denys Vlasenko wrote:
> > On 10/21/2015 12:31 AM, Andrew Morton wrote:
> >> Well, to fix this a distro needs to roll out a new kernel.  Or a new
> >> init(8).  Is there any reason to believe that distributing/deploying a
> >> new kernel is significantly easier for everyone?  Because fixing init
> >> sounds like a much preferable solution to this problem.
> >
> > People will continue to write new init(8) implementations,
> > and they will miss this obscure case.
> >
> > Before this bug was found, it was considered possible to use
> > a shell script as init process. What now, every shell needs to add
> > __WALL to its waitpids?

Why not? I think it can safely use __WALL too.

> > The use of PTRACE_TRACEME in this reproducer is clearly pathological:
> > PTRACE_TRACEME was never intended to be used to attach to unsuspecting
> > processes.

Sure. But people do the things which were never intended to be
used all the time. We simply can not know if this "feature"
already has a creative user or not.

As for the patch,

> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -385,6 +385,17 @@ static int ptrace_traceme(void)
>  	write_lock_irq(&tasklist_lock);
>  	/* Are we already being traced? */
>  	if (!current->ptrace) {
> +		struct pid_namespace *pid_ns;
> +
> +		pid_ns = task_active_pid_ns(current->parent);
> +		if (current->parent == pid_ns->child_reaper) {

Well, at least this needs same_thread_group(parent, child_reaper).

Plus we have PR_SET_CHILD_SUBREAPER so we also need to traverse the
->real_parent list if has_subreaper.

Finally it is not clear which ->child_reaper we should use after
setns(pidns_fd).

This all is fixable (although this again reminds me about a bug
with CHILD_SUBREAPER we probably need to fix first). But I didn't
even try to consider this option because it can break something.



And honestly, personally I don't like it. If we believe that we
can do this because "PTRACE_TRACEME was never intended to be used
to attach to unsuspecting processes", then we need a more generic
change, imo.

See http://marc.info/?l=linux-kernel&m=144536282305282 . Just in
case, it is not that I think "parent_exec_id != self_exec_id" is
all we need. This needs more discussion.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 21:47         ` Oleg Nesterov
@ 2015-10-21 23:27           ` Denys Vlasenko
  2015-10-25 15:54             ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Denys Vlasenko @ 2015-10-21 23:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List

On Wed, Oct 21, 2015 at 11:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 10/21, Denys Vlasenko wrote:
>>
>> On 10/21/2015 09:59 PM, Denys Vlasenko wrote:
>> > On 10/21/2015 12:31 AM, Andrew Morton wrote:
>> >> Well, to fix this a distro needs to roll out a new kernel.  Or a new
>> >> init(8).  Is there any reason to believe that distributing/deploying a
>> >> new kernel is significantly easier for everyone?  Because fixing init
>> >> sounds like a much preferable solution to this problem.
>> >
>> > People will continue to write new init(8) implementations,
>> > and they will miss this obscure case.
>> >
>> > Before this bug was found, it was considered possible to use
>> > a shell script as init process. What now, every shell needs to add
>> > __WALL to its waitpids?
>
> Why not? I think it can safely use __WALL too.

Because having any userspace program which can happen to be init,
which includes all shells out there in the wild
(bash, dash, ash, ksh, zsh, msh, hush,...)
learn about __WALL is wrong: apart from this wart, they do not have
to use any Linux-specific code. It can all be perfectly legitimate POSIX.

>> > The use of PTRACE_TRACEME in this reproducer is clearly pathological:
>> > PTRACE_TRACEME was never intended to be used to attach to unsuspecting
>> > processes.
>
> Sure. But people do the things which were never intended to be
> used all the time. We simply can not know if this "feature"
> already has a creative user or not.

It won't be unfixable: they will just have to switch from PTRACE_TRACEME
to PTRACE_ATTACH.

As of now we do not know any people craz^W creative enough
to create a cross between init and strace. If such specimens would
materialize, don't they deserve to have to make that change?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
  2015-10-20 22:31   ` Andrew Morton
@ 2015-10-22 13:51   ` Denys Vlasenko
  1 sibling, 0 replies; 21+ messages in thread
From: Denys Vlasenko @ 2015-10-22 13:51 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	linux-kernel

On 10/20/2015 07:17 PM, Oleg Nesterov wrote:
> creates the unreapable zombie if /sbin/init doesn't use __WALL.
> 
> This is not a kernel bug, at least in a sense that everything works as
> expected: debugger should reap a traced sub-thread before it can reap
> the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
> 
> Unfortunately, it seems that /sbin/init in most (all?) distributions
> doesn't use it and we have to change the kernel to avoid the problem.
> 
> This patch just adds the "ptrace" check into eligible_child(). To some
> degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal
> is mostly ignored when the tracee reports to debugger.
> 
> This obviously means the user-visible change: __WCLONE and __WALL no
> longer have any meaning for debugger. And I can only hope that this
> won't break something.
> 
> We could make a more conservative change. Say, we can take __WCLONE
> into account, or !thread_group_leader(). But it would be nice to not
> complicate these historical/confusing checks.

For the record, I like this way of fixing the problem too.

It removes the need to ever use __WALL in userspace:
IIRC the only case where it was needed are debuggers.

Which in turn makes *all* underscored wait flags (__WCLONE, __WALL,
__WNOTHREAD) unnecessary for any sane application use.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-20 17:17 [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
                   ` (2 preceding siblings ...)
  2015-10-20 17:36 ` [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
@ 2015-10-22 14:40 ` Pedro Alves
  2015-10-25 15:42   ` Oleg Nesterov
  3 siblings, 1 reply; 21+ messages in thread
From: Pedro Alves @ 2015-10-22 14:40 UTC (permalink / raw)
  To: Oleg Nesterov, Andrew Morton, Dmitry Vyukov
  Cc: Alexander Potapenko, Denys Vlasenko, Eric Dumazet,
	Jan Kratochvil, Julien Tinnes, Kees Cook, Kostya Serebryany,
	Linus Torvalds, Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller, linux-kernel, gdb

On 10/20/2015 06:17 PM, Oleg Nesterov wrote:

> Jan, Pedro, could you please confirm this won't break gdb? I tried
> to look into gdb-7.1, and at first glance gdb uses __WCLONE only
> because __WALL doesn't work on older kernels, iow it seems to me
> that gdb actually wants __WALL so this change should be fine.

Right, gdb actually wants __WALL, but it doesn't use it to keep
compatibility with kernels that predate it.

gdb nowadays has an __WALL emulation waitpid wrapper
(alternates __WCLONE+WNOHANG with 0+WNOHANG, blocks
on sigsuspend/SIGCHLD):

 https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-waitpid.c;h=cbcdd95afa9c664993542b0b3851e79fbae4e1df;hb=HEAD#l77

Though it's not used everywhere.  Some older code in the
ptrace backend open codes the "try __WCLONE, then try !__WCLONE."
dance.

Seems like __WALL was added in Linux 2.4; gdb could probably
assume it's available nowadays...

In any case, to make sure existing gdb binaries would still work
with your kernel change, I ran GDB's testsuite with this:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/gdb/nat/linux-waitpid.c b/gdb/nat/linux-waitpid.c
index cbcdd95..864ba2e 100644
--- a/gdb/nat/linux-waitpid.c
+++ b/gdb/nat/linux-waitpid.c
@@ -149,3 +149,17 @@ my_waitpid (int pid, int *status, int flags)
   errno = out_errno;
   return ret;
 }
+
+#include <dlfcn.h>
+
+pid_t
+waitpid (pid_t pid, int *status, int options)
+{
+  static pid_t (*waitpid2) (pid_t pid, int *status, int options) = NULL;
+
+  if (waitpid2 == NULL)
+    waitpid2 = dlsym (RTLD_NEXT, "waitpid");
+
+  options |= __WALL;
+  return waitpid2 (pid, status, options);
+}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and got no regressions.  So seems like all would be well from
GDB's perspective.

Thanks,
Pedro Alves


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-22 14:40 ` Pedro Alves
@ 2015-10-25 15:42   ` Oleg Nesterov
  0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-25 15:42 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Andrew Morton, Dmitry Vyukov, Alexander Potapenko,
	Denys Vlasenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller, linux-kernel, gdb

On 10/22, Pedro Alves wrote:
>
> In any case, to make sure existing gdb binaries would still work
> with your kernel change, I ran GDB's testsuite with this:
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> diff --git a/gdb/nat/linux-waitpid.c b/gdb/nat/linux-waitpid.c
> index cbcdd95..864ba2e 100644
> --- a/gdb/nat/linux-waitpid.c
> +++ b/gdb/nat/linux-waitpid.c
> @@ -149,3 +149,17 @@ my_waitpid (int pid, int *status, int flags)
>    errno = out_errno;
>    return ret;
>  }
> +
> +#include <dlfcn.h>
> +
> +pid_t
> +waitpid (pid_t pid, int *status, int options)
> +{
> +  static pid_t (*waitpid2) (pid_t pid, int *status, int options) = NULL;
> +
> +  if (waitpid2 == NULL)
> +    waitpid2 = dlsym (RTLD_NEXT, "waitpid");
> +
> +  options |= __WALL;
> +  return waitpid2 (pid, status, options);
> +}
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks a lot Pedro!

So gdb should be fine, strace too. Perhaps we should change the kernel
this way and forget about /sbin/init fixes.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-21 23:27           ` Denys Vlasenko
@ 2015-10-25 15:54             ` Oleg Nesterov
  2015-10-26 12:08               ` Pedro Alves
  0 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-25 15:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Pedro Alves, Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List

On 10/22, Denys Vlasenko wrote:
>
> On Wed, Oct 21, 2015 at 11:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > On 10/21, Denys Vlasenko wrote:
> >>
> >> On 10/21/2015 09:59 PM, Denys Vlasenko wrote:
> >> > On 10/21/2015 12:31 AM, Andrew Morton wrote:
> >> >> Well, to fix this a distro needs to roll out a new kernel.  Or a new
> >> >> init(8).  Is there any reason to believe that distributing/deploying a
> >> >> new kernel is significantly easier for everyone?  Because fixing init
> >> >> sounds like a much preferable solution to this problem.
> >> >
> >> > People will continue to write new init(8) implementations,
> >> > and they will miss this obscure case.
> >> >
> >> > Before this bug was found, it was considered possible to use
> >> > a shell script as init process. What now, every shell needs to add
> >> > __WALL to its waitpids?
> >
> > Why not? I think it can safely use __WALL too.
>
> Because having any userspace program which can happen to be init,
> which includes all shells out there in the wild
> (bash, dash, ash, ksh, zsh, msh, hush,...)
> learn about __WALL is wrong: apart from this wart, they do not have
> to use any Linux-specific code. It can all be perfectly legitimate POSIX.

Yes, this is true. I meant that they could safely use __WALL to, but I
understand that this change can be painful.

> > Sure. But people do the things which were never intended to be
> > used all the time. We simply can not know if this "feature"
> > already has a creative user or not.
>
> It won't be unfixable: they will just have to switch from PTRACE_TRACEME
> to PTRACE_ATTACH.
>
> As of now we do not know any people craz^W creative enough
> to create a cross between init and strace. If such specimens would
> materialize, don't they deserve to have to make that change?

This also applies to people who use bash/whatever as /sbin/init and allow
the untrusted users to run the exploits ;) I do not know who is more crazy.

In any case, the real question is whether we should change the kernel to
fix the problem, or ask the distros to fix their init's. In the former
case 1/2 looks simpler/safer to me than the change in ptrace_traceme(),
and you seem to agree that 1/2 is not that bad.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-25 15:54             ` Oleg Nesterov
@ 2015-10-26 12:08               ` Pedro Alves
  2015-10-28 16:11                 ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Pedro Alves @ 2015-10-26 12:08 UTC (permalink / raw)
  To: Oleg Nesterov, Denys Vlasenko
  Cc: Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List, gdb

On 10/25/2015 03:54 PM, Oleg Nesterov wrote:
> On 10/22, Denys Vlasenko wrote:
>>
>> On Wed, Oct 21, 2015 at 11:47 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>> On 10/21, Denys Vlasenko wrote:
>>>>
>>>> On 10/21/2015 09:59 PM, Denys Vlasenko wrote:
>>>>> On 10/21/2015 12:31 AM, Andrew Morton wrote:
>>>>>> Well, to fix this a distro needs to roll out a new kernel.  Or a new
>>>>>> init(8).  Is there any reason to believe that distributing/deploying a
>>>>>> new kernel is significantly easier for everyone?  Because fixing init
>>>>>> sounds like a much preferable solution to this problem.
>>>>>
>>>>> People will continue to write new init(8) implementations,
>>>>> and they will miss this obscure case.
>>>>>
>>>>> Before this bug was found, it was considered possible to use
>>>>> a shell script as init process. What now, every shell needs to add
>>>>> __WALL to its waitpids?
>>>
>>> Why not? I think it can safely use __WALL too.
>>
>> Because having any userspace program which can happen to be init,
>> which includes all shells out there in the wild
>> (bash, dash, ash, ksh, zsh, msh, hush,...)
>> learn about __WALL is wrong: apart from this wart, they do not have
>> to use any Linux-specific code. It can all be perfectly legitimate POSIX.
> 
> Yes, this is true. I meant that they could safely use __WALL to, but I
> understand that this change can be painful.
> 
>>> Sure. But people do the things which were never intended to be
>>> used all the time. We simply can not know if this "feature"
>>> already has a creative user or not.
>>
>> It won't be unfixable: they will just have to switch from PTRACE_TRACEME
>> to PTRACE_ATTACH.
>>
>> As of now we do not know any people craz^W creative enough
>> to create a cross between init and strace. If such specimens would
>> materialize, don't they deserve to have to make that change?
> 
> This also applies to people who use bash/whatever as /sbin/init and allow
> the untrusted users to run the exploits ;) I do not know who is more crazy.
> 
> In any case, the real question is whether we should change the kernel to
> fix the problem, or ask the distros to fix their init's. In the former
> case 1/2 looks simpler/safer to me than the change in ptrace_traceme(),
> and you seem to agree that 1/2 is not that bad.

A risk here seems to be that waitpid will start returning unexpected
(thread) PIDs to parent processes, and it's not unreasonable to assume
that e.g., a program asserts that waitpid either returns error or a
known (process) PID.

That's not an init-only issue, but something that might affect any
process that runs a child that happens to decide to
call PTRACE_TRACEME.

The ptrace man page says:

 "A process can initiate a trace by calling fork(2) and having the resulting
 child do a PTRACE_TRACEME, followed (typically) by an execve(2)."

Given that, can we instead make the kernel error out on PTRACE_TRACEME issued
from a non-leader thread?  Then between PTRACE_TRACEME and the parent's
waitpid, __WALL or !__WALL should make no difference.

(Also, in the original test case, if the child gets/raises a signal or execs
before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT,
and the child will thus end up stuck (though should be SIGKILLable,
I believe).  All this because PTRACE_TRACEME is broken by design by making it
be the child's choice whether to be traced...)

Thanks,
Pedro Alves


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-28 16:11                 ` Oleg Nesterov
@ 2015-10-28 15:43                   ` Pedro Alves
  2015-10-28 19:02                     ` Oleg Nesterov
  0 siblings, 1 reply; 21+ messages in thread
From: Pedro Alves @ 2015-10-28 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List, gdb

On 10/28/2015 04:11 PM, Oleg Nesterov wrote:
> On 10/26, Pedro Alves wrote:
>>
>> On 10/25/2015 03:54 PM, Oleg Nesterov wrote:
>>>
>>> In any case, the real question is whether we should change the kernel to
>>> fix the problem, or ask the distros to fix their init's. In the former
>>> case 1/2 looks simpler/safer to me than the change in ptrace_traceme(),
>>> and you seem to agree that 1/2 is not that bad.
>>
>> A risk here seems to be that waitpid will start returning unexpected
>> (thread) PIDs to parent processes,
> 
> I don't see how this change can make the things worse,
> 
>> and it's not unreasonable to assume
>> that e.g., a program asserts that waitpid either returns error or a
>> known (process) PID.
> 
> Well. /sbin/init can never assume this, obviously.

Right.  I was actually thinking of !init processes -- basically code
that spawns helper processes, keeps a data structure indexed by pid, then
discards the structure when the child exits.  Something like:

 pid = waitpid(-1, &status, 0);
 if (pid > 0)
 {
   struct child_process *child = find_process(pid);
   assert (child != NULL);
 }

As in, before your change, the child could get stuck forever, but after your
change, the parent could die/assert instead.

But ...

> 
>> That's not an init-only issue,
> 
> Yes. Because we have CLONE_PARENT. So "waitpid either returns error or a
> known (process) PID" is only true if you trust your children.

... OK, that's indeed a good point.

>> (Also, in the original test case, if the child gets/raises a signal or execs
>> before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT,
>> and the child will thus end up stuck (though should be SIGKILLable,
>
> Oh, but if it is killable everything is fine. How does this differ from the
> case when, say, you jusr reparent to init and do kill(getpid(), SIGSTOP) ?

The difference is that if the child called PTRACE_TRACEME, then it goes
to ptrace-stop instead and no amount of SIGCONT unstucks it -- the only way
out is force killing.  I agree it's not a major issue as there's a way out
(and thus made it a parens), but I wouldn't call it nice either.

>> All this because PTRACE_TRACEME is broken by design
>
> Heh. I agree. But we can't fix it now.

Perhaps the man page could document it as deprecated, suggesting
PTRACE_ATTACH/PTRACE_SEIZE instead?

Thanks,
Pedro Alves


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-26 12:08               ` Pedro Alves
@ 2015-10-28 16:11                 ` Oleg Nesterov
  2015-10-28 15:43                   ` Pedro Alves
  0 siblings, 1 reply; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-28 16:11 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Denys Vlasenko, Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List, gdb

On 10/26, Pedro Alves wrote:
>
> On 10/25/2015 03:54 PM, Oleg Nesterov wrote:
> >
> > In any case, the real question is whether we should change the kernel to
> > fix the problem, or ask the distros to fix their init's. In the former
> > case 1/2 looks simpler/safer to me than the change in ptrace_traceme(),
> > and you seem to agree that 1/2 is not that bad.
>
> A risk here seems to be that waitpid will start returning unexpected
> (thread) PIDs to parent processes,

I don't see how this change can make the things worse,

> and it's not unreasonable to assume
> that e.g., a program asserts that waitpid either returns error or a
> known (process) PID.

Well. /sbin/init can never assume this, obviously.

> That's not an init-only issue,

Yes. Because we have CLONE_PARENT. So "waitpid either returns error or a
known (process) PID" is only true if you trust your children.

> but something that might affect any
> process that runs a child that happens to decide to
> call PTRACE_TRACEME.
>
> The ptrace man page says:
>
>  "A process can initiate a trace by calling fork(2) and having the resulting
>  child do a PTRACE_TRACEME, followed (typically) by an execve(2)."
>
> Given that, can we instead make the kernel error out on PTRACE_TRACEME issued
> from a non-leader thread?  Then between PTRACE_TRACEME and the parent's
> waitpid, __WALL or !__WALL should make no difference.

Afaics not really. A group leader can do PTRACE_TRACEME and then
clone(CLONE_THREAD | CLONE_PTRACE) with the same effect.

> (Also, in the original test case, if the child gets/raises a signal or execs
> before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT,
> and the child will thus end up stuck (though should be SIGKILLable,

Oh, but if it is killable everything is fine. How does this differ from the
case when, say, you jusr reparent to init and do kill(getpid(), SIGSTOP) ?

> All this because PTRACE_TRACEME is broken by design

Heh. I agree. But we can't fix it now.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
  2015-10-28 15:43                   ` Pedro Alves
@ 2015-10-28 19:02                     ` Oleg Nesterov
  0 siblings, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2015-10-28 19:02 UTC (permalink / raw)
  To: Pedro Alves
  Cc: Denys Vlasenko, Denys Vlasenko, Andrew Morton, Dmitry Vyukov,
	Alexander Potapenko, Eric Dumazet, Jan Kratochvil, Julien Tinnes,
	Kees Cook, Kostya Serebryany, Linus Torvalds,
	Michael Kerrisk (man-pages),
	Robert Swiecki, Roland McGrath, syzkaller,
	Linux Kernel Mailing List, gdb

On 10/28, Pedro Alves wrote:
>
> On 10/28/2015 04:11 PM, Oleg Nesterov wrote:
> > On 10/26, Pedro Alves wrote:
> >>
> >> (Also, in the original test case, if the child gets/raises a signal or execs
> >> before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT,
> >> and the child will thus end up stuck (though should be SIGKILLable,
> >
> > Oh, but if it is killable everything is fine. How does this differ from the
> > case when, say, you jusr reparent to init and do kill(getpid(), SIGSTOP) ?
>
> The difference is that if the child called PTRACE_TRACEME, then it goes
> to ptrace-stop instead and no amount of SIGCONT unstucks it -- the only way
> out is force killing.  I agree it's not a major issue as there's a way out
> (and thus made it a parens), but I wouldn't call it nice either.

IOW, the difference is that it is TASK_TRACED, not TASK_STOPPED. I agree,
this is not nice. But this is not nice simply because PTRACE_TRACEME is
not nice.

> >> All this because PTRACE_TRACEME is broken by design
> >
> > Heh. I agree. But we can't fix it now.
>
> Perhaps the man page could document it as deprecated, suggesting
> PTRACE_ATTACH/PTRACE_SEIZE instead?

I don't know. but I won't mind if you mark PTRACE_ATTACH as deprecated
too ;) PTRACE_SEIZE can be used instead and it doesn't abuse SIGSTOP.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-10-28 18:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-20 17:17 [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
2015-10-20 17:17 ` [PATCH 1/2] " Oleg Nesterov
2015-10-20 22:31   ` Andrew Morton
2015-10-21  3:27     ` Vasily Averin
2015-10-21 17:41     ` Oleg Nesterov
2015-10-21 19:47       ` Andrew Morton
2015-10-21 20:44         ` Oleg Nesterov
2015-10-21 19:59     ` Denys Vlasenko
2015-10-21 20:31       ` Denys Vlasenko
2015-10-21 21:47         ` Oleg Nesterov
2015-10-21 23:27           ` Denys Vlasenko
2015-10-25 15:54             ` Oleg Nesterov
2015-10-26 12:08               ` Pedro Alves
2015-10-28 16:11                 ` Oleg Nesterov
2015-10-28 15:43                   ` Pedro Alves
2015-10-28 19:02                     ` Oleg Nesterov
2015-10-22 13:51   ` Denys Vlasenko
2015-10-20 17:17 ` [PATCH 2/2] wait: allow sys_waitid() to use __WNOTHREAD/__WCLONE/__WALL Oleg Nesterov
2015-10-20 17:36 ` [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced Oleg Nesterov
2015-10-22 14:40 ` Pedro Alves
2015-10-25 15:42   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).