linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
@ 2017-09-09  9:40 Jürg Billeter
  2017-09-12 17:05 ` Oleg Nesterov
  2017-09-29 12:30 ` [RESEND PATCH] " Jürg Billeter
  0 siblings, 2 replies; 24+ messages in thread
From: Jürg Billeter @ 2017-09-09  9:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Eric Biederman, linux-kernel, Jürg Billeter

PR_SET_PDEATHSIG sets a parent death signal that the calling process
will get when its parent thread dies, even when the result of getppid()
doesn't change because the calling process is reparented to a different
thread in the same parent process. When managing multiple processes, a
process-based parent death signal is much more useful. E.g., to avoid
stray child processes.

PR_SET_PDEATHSIG_PROC sets a process-based death signal. Unlike
PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
subtree without race conditions.

This can be used for sandboxing when combined with a seccomp filter.

There have been previous attempts to support this by changing the
behavior of PR_SET_PDEATHSIG. However, that would break existing
applications. See https://marc.info/?l=linux-kernel&m=117621804801689
and https://bugzilla.kernel.org/show_bug.cgi?id=43300

Signed-off-by: Jürg Billeter <j@bitron.ch>
---
 fs/exec.c                    |  1 +
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 ++++
 kernel/cred.c                |  1 +
 kernel/exit.c                |  4 ++++
 kernel/fork.c                |  2 ++
 kernel/sys.c                 | 11 +++++++++++
 security/apparmor/lsm.c      |  1 +
 security/selinux/hooks.c     |  1 +
 9 files changed, 28 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 01a9fb9d8ac3..bb389c3c596d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1353,6 +1353,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 	if (bprm->secureexec) {
 		/* Make sure parent cannot signal privileged process. */
 		current->pdeath_signal = 0;
+		current->signal->pdeath_signal_proc = 0;
 
 		/*
 		 * For secureexec, reset the stack limit to sane default to
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2a0dd40b15db..c5c137e5ef39 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -103,6 +103,9 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/* The signal sent when the parent dies: */
+	int			pdeath_signal_proc;
+
 	/*
 	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
 	 * manager, to re-parent orphan (double-forking) child processes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..04508e81d4f2 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,8 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Process-based variant of PDEATHSIG */
+#define PR_SET_PDEATHSIG_PROC		48
+#define PR_GET_PDEATHSIG_PROC		49
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index ecf03657e71c..0192a94670e1 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -448,6 +448,7 @@ int commit_creds(struct cred *new)
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
+		task->signal->pdeath_signal_proc = 0;
 		smp_wmb();
 	}
 
diff --git a/kernel/exit.c b/kernel/exit.c
index a35d8a17e01f..1be0616239e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -635,6 +635,10 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
 	if (unlikely(p->exit_state == EXIT_DEAD))
 		return;
 
+	if (p->signal->pdeath_signal_proc)
+		group_send_sig_info(p->signal->pdeath_signal_proc,
+				    SEND_SIG_NOINFO, p);
+
 	/* We don't want people slaying init. */
 	p->exit_signal = SIGCHLD;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 24a4c0be80d5..f6482392ece9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1412,6 +1412,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	mutex_init(&sig->cred_guard_mutex);
 
+	sig->pdeath_signal_proc = current->signal->pdeath_signal_proc;
+
 	return 0;
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 2855ee73acd0..c47e92fa5370 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2210,6 +2210,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_PDEATHSIG:
 		error = put_user(me->pdeath_signal, (int __user *)arg2);
 		break;
+	case PR_SET_PDEATHSIG_PROC:
+		if (!valid_signal(arg2)) {
+			error = -EINVAL;
+			break;
+		}
+		me->signal->pdeath_signal_proc = arg2;
+		break;
+	case PR_GET_PDEATHSIG_PROC:
+		error = put_user(me->signal->pdeath_signal_proc,
+				 (int __user *)arg2);
+		break;
 	case PR_GET_DUMPABLE:
 		error = get_dumpable(me->mm);
 		break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 7a82c0f61452..c8bd6b1331c1 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -628,6 +628,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
 	aa_inherit_files(bprm->cred, current->files);
 
 	current->pdeath_signal = 0;
+	current->signal->pdeath_signal_proc = 0;
 
 	/* reset soft limits and set hard limits for the new label */
 	__aa_transition_rlimits(label, new_ctx->label);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ad3b0f53ede0..574d6238f8de 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2527,6 +2527,7 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
 
 	/* Always clear parent death signal on SID transitions. */
 	current->pdeath_signal = 0;
+	current->signal->pdeath_signal_proc = 0;
 
 	/* Check whether the new SID can inherit resource limits from the old
 	 * SID.  If not, reset all soft limits to the lower of the current
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-09  9:40 [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC Jürg Billeter
@ 2017-09-12 17:05 ` Oleg Nesterov
  2017-09-12 18:54   ` Jürg Billeter
  2017-09-29 12:30 ` [RESEND PATCH] " Jürg Billeter
  1 sibling, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2017-09-12 17:05 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Eric Biederman, linux-kernel, Linus Torvalds,
	Michael Kerrisk

On 09/09, Jürg Billeter wrote:
>
> PR_SET_PDEATHSIG_PROC sets a process-based death signal.

I think the patch is technically correct,

> Unlike
> PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> subtree without race conditions.

but I am still not sure this is right... at least I can't understand the
"without race conditions" above.

IOW, the child can do prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) right after fork(),
why this is not enough to kill a whole subtree without race conditions?

OTOH. If you want to kill a whole sub-tree then perhaps the exiting process
should simply send the ->pdeath_signal_proc to the whole sub-tree? Not that
I really think this makes more sense, but if we add the new API we should
discuss everything we can.

Say, CLONE_PARENT. Should it succeed if ->pdeath_signal_proc != 0 ?

Anyway, I think this patch needs more reviewers. Let me add Linus and
Michael. Again, I am not worried about correctness, the patch is simple,
but the new API always needs a thorough discussion.

Oleg.

> This can be used for sandboxing when combined with a seccomp filter.
> 
> There have been previous attempts to support this by changing the
> behavior of PR_SET_PDEATHSIG. However, that would break existing
> applications. See https://marc.info/?l=linux-kernel&m=117621804801689
> and https://bugzilla.kernel.org/show_bug.cgi?id=43300
> 
> Signed-off-by: Jürg Billeter <j@bitron.ch>
> ---
>  fs/exec.c                    |  1 +
>  include/linux/sched/signal.h |  3 +++
>  include/uapi/linux/prctl.h   |  4 ++++
>  kernel/cred.c                |  1 +
>  kernel/exit.c                |  4 ++++
>  kernel/fork.c                |  2 ++
>  kernel/sys.c                 | 11 +++++++++++
>  security/apparmor/lsm.c      |  1 +
>  security/selinux/hooks.c     |  1 +
>  9 files changed, 28 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 01a9fb9d8ac3..bb389c3c596d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1353,6 +1353,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>  	if (bprm->secureexec) {
>  		/* Make sure parent cannot signal privileged process. */
>  		current->pdeath_signal = 0;
> +		current->signal->pdeath_signal_proc = 0;
>  
>  		/*
>  		 * For secureexec, reset the stack limit to sane default to
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 2a0dd40b15db..c5c137e5ef39 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -103,6 +103,9 @@ struct signal_struct {
>  	int			group_stop_count;
>  	unsigned int		flags; /* see SIGNAL_* flags below */
>  
> +	/* The signal sent when the parent dies: */
> +	int			pdeath_signal_proc;
> +
>  	/*
>  	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
>  	 * manager, to re-parent orphan (double-forking) child processes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index a8d0759a9e40..04508e81d4f2 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -197,4 +197,8 @@ struct prctl_mm_map {
>  # define PR_CAP_AMBIENT_LOWER		3
>  # define PR_CAP_AMBIENT_CLEAR_ALL	4
>  
> +/* Process-based variant of PDEATHSIG */
> +#define PR_SET_PDEATHSIG_PROC		48
> +#define PR_GET_PDEATHSIG_PROC		49
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/cred.c b/kernel/cred.c
> index ecf03657e71c..0192a94670e1 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -448,6 +448,7 @@ int commit_creds(struct cred *new)
>  		if (task->mm)
>  			set_dumpable(task->mm, suid_dumpable);
>  		task->pdeath_signal = 0;
> +		task->signal->pdeath_signal_proc = 0;
>  		smp_wmb();
>  	}
>  
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a35d8a17e01f..1be0616239e0 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -635,6 +635,10 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
>  	if (unlikely(p->exit_state == EXIT_DEAD))
>  		return;
>  
> +	if (p->signal->pdeath_signal_proc)
> +		group_send_sig_info(p->signal->pdeath_signal_proc,
> +				    SEND_SIG_NOINFO, p);
> +
>  	/* We don't want people slaying init. */
>  	p->exit_signal = SIGCHLD;
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 24a4c0be80d5..f6482392ece9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1412,6 +1412,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  
>  	mutex_init(&sig->cred_guard_mutex);
>  
> +	sig->pdeath_signal_proc = current->signal->pdeath_signal_proc;
> +
>  	return 0;
>  }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 2855ee73acd0..c47e92fa5370 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2210,6 +2210,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_GET_PDEATHSIG:
>  		error = put_user(me->pdeath_signal, (int __user *)arg2);
>  		break;
> +	case PR_SET_PDEATHSIG_PROC:
> +		if (!valid_signal(arg2)) {
> +			error = -EINVAL;
> +			break;
> +		}
> +		me->signal->pdeath_signal_proc = arg2;
> +		break;
> +	case PR_GET_PDEATHSIG_PROC:
> +		error = put_user(me->signal->pdeath_signal_proc,
> +				 (int __user *)arg2);
> +		break;
>  	case PR_GET_DUMPABLE:
>  		error = get_dumpable(me->mm);
>  		break;
> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
> index 7a82c0f61452..c8bd6b1331c1 100644
> --- a/security/apparmor/lsm.c
> +++ b/security/apparmor/lsm.c
> @@ -628,6 +628,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
>  	aa_inherit_files(bprm->cred, current->files);
>  
>  	current->pdeath_signal = 0;
> +	current->signal->pdeath_signal_proc = 0;
>  
>  	/* reset soft limits and set hard limits for the new label */
>  	__aa_transition_rlimits(label, new_ctx->label);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index ad3b0f53ede0..574d6238f8de 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2527,6 +2527,7 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
>  
>  	/* Always clear parent death signal on SID transitions. */
>  	current->pdeath_signal = 0;
> +	current->signal->pdeath_signal_proc = 0;
>  
>  	/* Check whether the new SID can inherit resource limits from the old
>  	 * SID.  If not, reset all soft limits to the lower of the current
> -- 
> 2.14.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-12 17:05 ` Oleg Nesterov
@ 2017-09-12 18:54   ` Jürg Billeter
  2017-09-13 17:11     ` Oleg Nesterov
  0 siblings, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-09-12 18:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Eric Biederman, linux-kernel, Linus Torvalds,
	Michael Kerrisk

Hi Oleg,

Thanks for the review.

On Tue, 2017-09-12 at 19:05 +0200, Oleg Nesterov wrote:
> On 09/09, Jürg Billeter wrote:
> > Unlike
> > PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> > subtree without race conditions.
> 
> but I am still not sure this is right... at least I can't understand the
> "without race conditions" above.
> 
> IOW, the child can do prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) right after fork(),
> why this is not enough to kill a whole subtree without race conditions?

What if the parent dies between fork() and prctl()? Besides avoiding
this race condition, it also makes it relatively easy to enforce
PDEATHSIG_PROC for all descendants of a process. You simply set
PDEATHSIG_PROC and then block further changes using seccomp (and set
no_new_privs) to avoid runaway children.

> OTOH. If you want to kill a whole sub-tree then perhaps the exiting process
> should simply send the ->pdeath_signal_proc to the whole sub-tree? Not that
> I really think this makes more sense, but if we add the new API we should
> discuss everything we can.

While this would likely work for my use case of avoiding runaway
processes, I don't think it would make sense for non-SIGKILL use cases
of cooperating processes. Inheritance across fork still allows
resetting PDEATHSIG_PROC in the child after fork and I don't expect the
parent death race to be a significant issue in the case of cooperating
processes.

> Say, CLONE_PARENT. Should it succeed if ->pdeath_signal_proc != 0 ?

Yes, I don't see an issue with that. The new process will be a sibling
and inheriting pdeath_signal_proc seems sensible to me for this.

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-12 18:54   ` Jürg Billeter
@ 2017-09-13 17:11     ` Oleg Nesterov
  2017-09-13 17:26       ` Jürg Billeter
  0 siblings, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2017-09-13 17:11 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Eric Biederman, linux-kernel, Linus Torvalds,
	Michael Kerrisk

On 09/12, Jürg Billeter wrote:
>
> On Tue, 2017-09-12 at 19:05 +0200, Oleg Nesterov wrote:
> > On 09/09, Jürg Billeter wrote:
> > > Unlike
> > > PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> > > subtree without race conditions.
> > 
> > but I am still not sure this is right... at least I can't understand the
> > "without race conditions" above.
> > 
> > IOW, the child can do prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) right after fork(),
> > why this is not enough to kill a whole subtree without race conditions?
> 
> What if the parent dies between fork() and prctl()?

The child will be killed? Sorry, can't understand...

> it also makes it relatively easy to enforce
> PDEATHSIG_PROC for all descendants of a process.

this is clear,

> > Say, CLONE_PARENT. Should it succeed if ->pdeath_signal_proc != 0 ?
>
> Yes, I don't see an issue with that. The new process will be a sibling
> and inheriting pdeath_signal_proc seems sensible to me for this.

I meant, the process created by clone(CLONE_PARENT) won't be killed by
pdeath_signal if the creator process exits, exactly because it won't be
its child. Not that I think this is wrong.

Oleg.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-13 17:11     ` Oleg Nesterov
@ 2017-09-13 17:26       ` Jürg Billeter
  2017-09-13 17:48         ` Oleg Nesterov
  0 siblings, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-09-13 17:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Eric Biederman, linux-kernel, Linus Torvalds,
	Michael Kerrisk

On Wed, 2017-09-13 at 19:11 +0200, Oleg Nesterov wrote:
> On 09/12, Jürg Billeter wrote:
> > 
> > On Tue, 2017-09-12 at 19:05 +0200, Oleg Nesterov wrote:
> > > On 09/09, Jürg Billeter wrote:
> > > > Unlike
> > > > PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> > > > subtree without race conditions.
> > > 
> > > but I am still not sure this is right... at least I can't understand the
> > > "without race conditions" above.
> > > 
> > > IOW, the child can do prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) right after fork(),
> > > why this is not enough to kill a whole subtree without race conditions?
> > 
> > What if the parent dies between fork() and prctl()?
> 
> The child will be killed? Sorry, can't understand...

If PR_SET_PDEATHSIG_PROC was not inherited across fork and the parent
died between fork() and prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) in the
child, the child would not be killed. It would be reparented to init(1)
or a subreaper, i.e., you end up with a runaway process. It would be
possible to safe guard against this race condition in other ways but
inheriting the setting avoids it nicely, and makes it easy to
apply/enforce PDEATHSIG_PROC for all descendants.

> > > Say, CLONE_PARENT. Should it succeed if ->pdeath_signal_proc != 0 ?
> > 
> > Yes, I don't see an issue with that. The new process will be a sibling
> > and inheriting pdeath_signal_proc seems sensible to me for this.
> 
> I meant, the process created by clone(CLONE_PARENT) won't be killed by
> pdeath_signal if the creator process exits, exactly because it won't be
> its child. Not that I think this is wrong.

Right, creator and parent won't be the same.

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-13 17:26       ` Jürg Billeter
@ 2017-09-13 17:48         ` Oleg Nesterov
  0 siblings, 0 replies; 24+ messages in thread
From: Oleg Nesterov @ 2017-09-13 17:48 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Eric Biederman, linux-kernel, Linus Torvalds,
	Michael Kerrisk

On 09/13, Jürg Billeter wrote:
>
> On Wed, 2017-09-13 at 19:11 +0200, Oleg Nesterov wrote:
> > On 09/12, Jürg Billeter wrote:
> > >
> > > On Tue, 2017-09-12 at 19:05 +0200, Oleg Nesterov wrote:
> > > > On 09/09, Jürg Billeter wrote:
> > > > > Unlike
> > > > > PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> > > > > subtree without race conditions.
> > > >
> > > > but I am still not sure this is right... at least I can't understand the
> > > > "without race conditions" above.
> > > >
> > > > IOW, the child can do prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) right after fork(),
> > > > why this is not enough to kill a whole subtree without race conditions?
> > >
> > > What if the parent dies between fork() and prctl()?
> >
> > The child will be killed? Sorry, can't understand...
>
> If PR_SET_PDEATHSIG_PROC was not inherited across fork and the parent
> died between fork() and prctl(PR_SET_PDEATHSIG_PROC, SIGKILL) in the
> child, the child would not be killed.

Aah, sorry. I forgot about another oddity of pdeath_signal API...

Somehow I misread this patch as if reparent_leader() looks at
current->signal->pdeath_signal_proc, not child->signal->pdeath_signal_proc.
And to me the former makes more sense. But I won't insist.

Oleg.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-09  9:40 [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC Jürg Billeter
  2017-09-12 17:05 ` Oleg Nesterov
@ 2017-09-29 12:30 ` Jürg Billeter
  2017-10-02 23:20   ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-09-29 12:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Linus Torvalds, Eric Biederman, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, Adam H . Peterson, hansecke,
	linux-kernel, Jürg Billeter

PR_SET_PDEATHSIG sets a parent death signal that the calling process
will get when its parent thread dies, even when the result of getppid()
doesn't change because the calling process is reparented to a different
thread in the same parent process. When managing multiple processes, a
process-based parent death signal is much more useful. E.g., to avoid
stray child processes.

PR_SET_PDEATHSIG_PROC sets a process-based death signal. Unlike
PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
subtree without race conditions.

This can be used for sandboxing when combined with a seccomp filter.

There have been previous attempts to support this by changing the
behavior of PR_SET_PDEATHSIG. However, that would break existing
applications. See https://marc.info/?l=linux-kernel&m=117621804801689
and https://bugzilla.kernel.org/show_bug.cgi?id=43300

Signed-off-by: Jürg Billeter <j@bitron.ch>
---

Previous discussion: https://patchwork.kernel.org/patch/9945315/

 fs/exec.c                    |  1 +
 include/linux/sched/signal.h |  3 +++
 include/uapi/linux/prctl.h   |  4 ++++
 kernel/cred.c                |  1 +
 kernel/exit.c                |  4 ++++
 kernel/fork.c                |  2 ++
 kernel/sys.c                 | 11 +++++++++++
 security/apparmor/lsm.c      |  1 +
 security/selinux/hooks.c     |  1 +
 9 files changed, 28 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index ac34d9724684..7045f0223140 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1334,6 +1334,7 @@ void setup_new_exec(struct linux_binprm * bprm)
 	if (bprm->secureexec) {
 		/* Make sure parent cannot signal privileged process. */
 		current->pdeath_signal = 0;
+		current->signal->pdeath_signal_proc = 0;
 
 		/*
 		 * For secureexec, reset the stack limit to sane default to
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2a0dd40b15db..c5c137e5ef39 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -103,6 +103,9 @@ struct signal_struct {
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
 
+	/* The signal sent when the parent dies: */
+	int			pdeath_signal_proc;
+
 	/*
 	 * PR_SET_CHILD_SUBREAPER marks a process, like a service
 	 * manager, to re-parent orphan (double-forking) child processes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..04508e81d4f2 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,8 @@ struct prctl_mm_map {
 # define PR_CAP_AMBIENT_LOWER		3
 # define PR_CAP_AMBIENT_CLEAR_ALL	4
 
+/* Process-based variant of PDEATHSIG */
+#define PR_SET_PDEATHSIG_PROC		48
+#define PR_GET_PDEATHSIG_PROC		49
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index ecf03657e71c..0192a94670e1 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -448,6 +448,7 @@ int commit_creds(struct cred *new)
 		if (task->mm)
 			set_dumpable(task->mm, suid_dumpable);
 		task->pdeath_signal = 0;
+		task->signal->pdeath_signal_proc = 0;
 		smp_wmb();
 	}
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 3481ababd06a..9b6fbb0128d7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -635,6 +635,10 @@ static void reparent_leader(struct task_struct *father, struct task_struct *p,
 	if (unlikely(p->exit_state == EXIT_DEAD))
 		return;
 
+	if (p->signal->pdeath_signal_proc)
+		group_send_sig_info(p->signal->pdeath_signal_proc,
+				    SEND_SIG_NOINFO, p);
+
 	/* We don't want people slaying init. */
 	p->exit_signal = SIGCHLD;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 10646182440f..264936c367e3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1415,6 +1415,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	mutex_init(&sig->cred_guard_mutex);
 
+	sig->pdeath_signal_proc = current->signal->pdeath_signal_proc;
+
 	return 0;
 }
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 9aebc2935013..dcb9a535404e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2206,6 +2206,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_PDEATHSIG:
 		error = put_user(me->pdeath_signal, (int __user *)arg2);
 		break;
+	case PR_SET_PDEATHSIG_PROC:
+		if (!valid_signal(arg2)) {
+			error = -EINVAL;
+			break;
+		}
+		me->signal->pdeath_signal_proc = arg2;
+		break;
+	case PR_GET_PDEATHSIG_PROC:
+		error = put_user(me->signal->pdeath_signal_proc,
+				 (int __user *)arg2);
+		break;
 	case PR_GET_DUMPABLE:
 		error = get_dumpable(me->mm);
 		break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 72b915dfcaf7..98cd937c337d 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -689,6 +689,7 @@ static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
 	aa_inherit_files(bprm->cred, current->files);
 
 	current->pdeath_signal = 0;
+	current->signal->pdeath_signal_proc = 0;
 
 	/* reset soft limits and set hard limits for the new label */
 	__aa_transition_rlimits(label, new_ctx->label);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index f5d304736852..19d97d5acdb9 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2547,6 +2547,7 @@ static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
 
 	/* Always clear parent death signal on SID transitions. */
 	current->pdeath_signal = 0;
+	current->signal->pdeath_signal_proc = 0;
 
 	/* Check whether the new SID can inherit resource limits from the old
 	 * SID.  If not, reset all soft limits to the lower of the current
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-09-29 12:30 ` [RESEND PATCH] " Jürg Billeter
@ 2017-10-02 23:20   ` Andrew Morton
  2017-10-03  3:25     ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2017-10-02 23:20 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Oleg Nesterov, Linus Torvalds, Eric Biederman, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, Adam H . Peterson, hansecke,
	linux-kernel

On Fri, 29 Sep 2017 14:30:58 +0200 Jürg Billeter <j@bitron.ch> wrote:

> PR_SET_PDEATHSIG sets a parent death signal that the calling process
> will get when its parent thread dies, even when the result of getppid()
> doesn't change because the calling process is reparented to a different
> thread in the same parent process. When managing multiple processes, a
> process-based parent death signal is much more useful. E.g., to avoid
> stray child processes.
> 
> PR_SET_PDEATHSIG_PROC sets a process-based death signal. Unlike
> PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
> subtree without race conditions.
> 
> This can be used for sandboxing when combined with a seccomp filter.
> 
> There have been previous attempts to support this by changing the
> behavior of PR_SET_PDEATHSIG. However, that would break existing
> applications. See https://marc.info/?l=linux-kernel&m=117621804801689
> and https://bugzilla.kernel.org/show_bug.cgi?id=43300

Are Eric and Oleg OK with this?

A prctl manpage update will be needed, please (cc linux-api).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-02 23:20   ` Andrew Morton
@ 2017-10-03  3:25     ` Eric W. Biederman
  2017-10-03  6:45       ` Jürg Billeter
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jürg Billeter, Oleg Nesterov, Linus Torvalds,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox,
	Adam H . Peterson, hansecke, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Sep 2017 14:30:58 +0200 Jürg Billeter <j@bitron.ch> wrote:
>
>> PR_SET_PDEATHSIG sets a parent death signal that the calling process
>> will get when its parent thread dies, even when the result of getppid()
>> doesn't change because the calling process is reparented to a different
>> thread in the same parent process. When managing multiple processes, a
>> process-based parent death signal is much more useful. E.g., to avoid
>> stray child processes.
>> 
>> PR_SET_PDEATHSIG_PROC sets a process-based death signal. Unlike
>> PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole
>> subtree without race conditions.
>> 
>> This can be used for sandboxing when combined with a seccomp filter.
>> 
>> There have been previous attempts to support this by changing the
>> behavior of PR_SET_PDEATHSIG. However, that would break existing
>> applications. See https://marc.info/?l=linux-kernel&m=117621804801689
>> and https://bugzilla.kernel.org/show_bug.cgi?id=43300
>
> Are Eric and Oleg OK with this?
>
> A prctl manpage update will be needed, please (cc linux-api).

It makes for an interesting way of killing a process tree.  The domino
effect.

I believe the rational for adding a new prctl.

The code where it calls group_send_sig_info is buggy for pdeath_signal.
And it no less buggy for this new case.  There is no point to check
permissions when sending a signal to yourself.  Especially this signal
gets cleared during exec with a change of permissions.


I would recommend using:
 do_send_sig_info(p->signal->pdeath_signal_proc, SEND_SIG_NOINFO, p, true);

Perhaps with a comment saying that no permission check is needed when
sending a signal to yourself.


I don't know what I think about inherit over fork, and the whole tree
killing thing.  Except when the signal is SIGKILL I don't know if that
code does what is intended.  So I am a little leary of it.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03  3:25     ` Eric W. Biederman
@ 2017-10-03  6:45       ` Jürg Billeter
  2017-10-03 14:46         ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-10-03  6:45 UTC (permalink / raw)
  To: Eric W. Biederman, Andrew Morton
  Cc: Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

On Mon, 2017-10-02 at 22:25 -0500, Eric W. Biederman wrote:
> The code where it calls group_send_sig_info is buggy for pdeath_signal.
> And it no less buggy for this new case.  There is no point to check
> permissions when sending a signal to yourself.  Especially this signal
> gets cleared during exec with a change of permissions.
> 
> 
> I would recommend using:
>  do_send_sig_info(p->signal->pdeath_signal_proc, SEND_SIG_NOINFO, p, true);
> 
> Perhaps with a comment saying that no permission check is needed when
> sending a signal to yourself.

Depending on how you look at it, one could also argue that the dying
parent sends the signal. However, I'm fine with dropping the permission
check in v2. I'll also send a patch to change this for the existing
pdeath_signal.

> I don't know what I think about inherit over fork, and the whole tree
> killing thing.  Except when the signal is SIGKILL I don't know if that
> code does what is intended.  So I am a little leary of it.

I agree that inheritance across fork is mainly useful for SIGKILL.
While non-SIGKILL users could clear the setting after fork(), another
option would be to allow the caller to specify whether the setting
should be inherited using prctl arg3.

This would allow both, the exact process-based equivalent to
pdeath_signal (no inheritance) as well as the interesting SIGKILL case
for killing a process tree. Does this sound sensible? I'd be happy to
add this to v2.

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03  6:45       ` Jürg Billeter
@ 2017-10-03 14:46         ` Eric W. Biederman
  2017-10-03 16:10           ` Linus Torvalds
  2017-10-03 17:00           ` Jürg Billeter
  0 siblings, 2 replies; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 14:46 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

Jürg Billeter <j@bitron.ch> writes:

> On Mon, 2017-10-02 at 22:25 -0500, Eric W. Biederman wrote:
>> The code where it calls group_send_sig_info is buggy for pdeath_signal.
>> And it no less buggy for this new case.  There is no point to check
>> permissions when sending a signal to yourself.  Especially this signal
>> gets cleared during exec with a change of permissions.
>> 
>> 
>> I would recommend using:
>>  do_send_sig_info(p->signal->pdeath_signal_proc, SEND_SIG_NOINFO, p, true);
>> 
>> Perhaps with a comment saying that no permission check is needed when
>> sending a signal to yourself.
>
> Depending on how you look at it, one could also argue that the dying
> parent sends the signal. However, I'm fine with dropping the permission
> check in v2. I'll also send a patch to change this for the existing
> pdeath_signal.

The process that requests the signal be sent is the process that is
receiving the signal.  I can see a theoretical need for a permission
check in there somewhere (especially as this persists over fork).

That is what matters for a permission check.

I tried fixing this once along with a bunch of other changes, and
apparently I did not do it obviously enough.  So I think this needs to
happen but let's make it very clear.

>> I don't know what I think about inherit over fork, and the whole tree
>> killing thing.  Except when the signal is SIGKILL I don't know if that
>> code does what is intended.  So I am a little leary of it.
>
> I agree that inheritance across fork is mainly useful for SIGKILL.
> While non-SIGKILL users could clear the setting after fork(), another
> option would be to allow the caller to specify whether the setting
> should be inherited using prctl arg3.

>
> This would allow both, the exact process-based equivalent to
> pdeath_signal (no inheritance) as well as the interesting SIGKILL case
> for killing a process tree. Does this sound sensible? I'd be happy to
> add this to v2.

Possibly.

There is a general need to find out about the death of other processes,
if you are not the parent of the process.   I would be inclined to call
it waitfd.  Something that you give a pid.  It performs a permission
check and the pid becomes readable when the process dies.  With poll
working on the fd, and the fd returning wstatus of the dead child.

Support SIGIO on the fd and you have a signal delivery mechanism,
if you want it.

For the kill all children when the parent dies the mechanism you are
proposing is escapable.  We already have an inescapable version of it
with init in a pid namespace.  We already have an escapable version of
it with orphaned process groups and SIGHUP.

So I would really appreciate a very clear use case for what we are
building here.  As it appears the killing of children can already be
done another way, and that the waiting for the parent can be done better
another way.

At the end of the day I would prefer to support something that has a
good solid definition and a clear use case.

I may be missing something and there is a good reason to extend pdeath
signal, but I don't see it right now.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 14:46         ` Eric W. Biederman
@ 2017-10-03 16:10           ` Linus Torvalds
  2017-10-03 16:36             ` Eric W. Biederman
  2017-10-03 17:00           ` Jürg Billeter
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2017-10-03 16:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

On Tue, Oct 3, 2017 at 7:46 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The process that requests the signal be sent is the process that is
> receiving the signal.  I can see a theoretical need for a permission
> check in there somewhere (especially as this persists over fork).

Note that it also persists over not just fork, but execve() too.

Yes, the signal is cleared if the e[ug]id/fs[ug]id is changed by exec,
but not (for example) if just uid is changed.

Does that matter? Probably not. But signal handling does actually
check uid, so it does actually affect signal permission checks across
execve.

I'm not entirely convinced about this patch. The parent death signal
has been problematic before, I'm not sure we want to add to the whole
situation without very strong arguments.

                    Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 16:10           ` Linus Torvalds
@ 2017-10-03 16:36             ` Eric W. Biederman
  2017-10-03 17:02               ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 3, 2017 at 7:46 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The process that requests the signal be sent is the process that is
>> receiving the signal.  I can see a theoretical need for a permission
>> check in there somewhere (especially as this persists over fork).
>
> Note that it also persists over not just fork, but execve() too.
>
> Yes, the signal is cleared if the e[ug]id/fs[ug]id is changed by exec,
> but not (for example) if just uid is changed.

*Scratches head*
pdeath_signal is cleared during exec if bprm->cap_elevated.

bprm->cap_elevated is set if we are not root and we gain caps during the exec.
bprm->cap_elevated is set if is_setid is true.
is_setid is set if the uid != eid or gid != egid.

So looking at that I am not exactly wild about the name cap_elevated,
but it seems to clear pdeath_signal if the jus the uid is changed during
exec.

> Does that matter? Probably not. But signal handling does actually
> check uid, so it does actually affect signal permission checks across
> execve.

I don't think there is anything in exec in this case to worry about.

Of course there is the completely bizarre case that if the parent execs
or calls setresuid it is possible that the signal won't send because it
is the parent's permission that are checked.   I think that is probably
a bug.

I can understand not sending to our future self if our future self has
different credentials than our present self.  But not sending to our
future self because someone else changed seems completely bizarre to me.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 14:46         ` Eric W. Biederman
  2017-10-03 16:10           ` Linus Torvalds
@ 2017-10-03 17:00           ` Jürg Billeter
  2017-10-03 17:40             ` Eric W. Biederman
  2017-10-05 16:27             ` Oleg Nesterov
  1 sibling, 2 replies; 24+ messages in thread
From: Jürg Billeter @ 2017-10-03 17:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

On Tue, 2017-10-03 at 09:46 -0500, Eric W. Biederman wrote:
> There is a general need to find out about the death of other processes,
> if you are not the parent of the process.   I would be inclined to call
> it waitfd.  Something that you give a pid.  It performs a permission
> check and the pid becomes readable when the process dies.  With poll
> working on the fd, and the fd returning wstatus of the dead child.
> 
> Support SIGIO on the fd and you have a signal delivery mechanism,
> if you want it.

File descriptors for processes (waitfd/clonefd) are definitely
interesting.  Especially if reaping the process (and reparenting its
children) is delayed until the last process file descriptor is closed. 
However, this would be a much larger addition and also less intuitive
to use if all you want is killing the process tree.

> For the kill all children when the parent dies the mechanism you are
> proposing is escapable.  We already have an inescapable version of it
> with init in a pid namespace.  We already have an escapable version of
> it with orphaned process groups and SIGHUP.
> 
> So I would really appreciate a very clear use case for what we are
> building here.  As it appears the killing of children can already be
> done another way, and that the waiting for the parent can be done better
> another way.

My use case is to provide a way for a process to spawn a child and
ensure that no descendants survive when that child dies.  Avoiding
runaway processes is desirable in many situations.  My motivation is
very lightweight (nested) sandboxing (every process is potentially
sandboxed).

I.e., pid namespaces would be a pretty good fit (assuming they are
sufficiently lightweight) but CLONE_NEWPID requires CAP_SYS_ADMIN. 
User namespaces can help here, but creating tons of user namespaces
just for this doesn't sound sensible. MAX_PID_NS_LEVEL could be an
issue as well at some point but 32 levels are likely fine in practice.

For my particular scenario I may actually be able to create a single
user namespace, run all processes with (namespaced) CAP_SYS_ADMIN and
use CLONE_NEWPID for every process.  However, I would prefer not
requiring CAP_SYS_ADMIN and a regular application that wants to avoid
runaway processes for a spawned helper process cannot rely on
CAP_SYS_ADMIN.

My plan was to use PR_SET_PDEATHSIG_PROC with PR_NO_NEW_PRIVS and a
suitable seccomp filter to prevent changes to pdeath_signal_proc.  For
my SIGKILL use case it would be even better to simply require
PR_NO_NEW_PRIVS and make pdeath_signal_proc sticky, avoiding the need
for seccomp.  I wanted to keep the differences to the existing
PR_SET_PDEATHSIG minimal but if we argue that the non-SIGKILL use case
is better solved with waitfd (or maybe the process events connector),
we could tailor the prctl for the SIGKILL use case (or support both via
prctl arg3).

I have another small patch locally that adds a prctl that restricts
kill(2) to direct children of the current thread group for lightweight
sandboxing.  That would also be redundant if it was possible to use
CLONE_NEWPID for every process.

What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN? 
Does CLONE_NEWPID pose any risks that don't exist for
CLONE_NEWUSER|CLONE_NEWPID?  Assuming we can't simply drop the
CAP_SYS_ADMIN requirement, do you see a better solution for this use
case?

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 16:36             ` Eric W. Biederman
@ 2017-10-03 17:02               ` Linus Torvalds
  2017-10-03 19:30                 ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2017-10-03 17:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

On Tue, Oct 3, 2017 at 9:36 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> *Scratches head*
> pdeath_signal is cleared during exec if bprm->cap_elevated.

It's not cleared if we are *releasing* capabilities, which is exactly
what might cause a "we can no longer send a signal"

> is_setid is set if the uid != eid or gid != egid.

Again, that may be exactly what changes - the original process may
have uid != euid, and now we're going from an "we still had a root
uid/suid" to "dropping everything to euid".

IOW, we're _dropping_ capabilities, not adding them. Maybe we don't
want to allow signaling the original parent any more.

That said, as mentioned, I actually don't think it's a real problem.
The real problem is entirely conceptual: yet more complexity in an
area that we've already had problems in before.

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 17:00           ` Jürg Billeter
@ 2017-10-03 17:40             ` Eric W. Biederman
  2017-10-03 17:47               ` Jürg Billeter
  2017-10-05 16:27             ` Oleg Nesterov
  1 sibling, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 17:40 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

Jürg Billeter <j@bitron.ch> writes:

> On Tue, 2017-10-03 at 09:46 -0500, Eric W. Biederman wrote:
>> There is a general need to find out about the death of other processes,
>> if you are not the parent of the process.   I would be inclined to call
>> it waitfd.  Something that you give a pid.  It performs a permission
>> check and the pid becomes readable when the process dies.  With poll
>> working on the fd, and the fd returning wstatus of the dead child.
>> 
>> Support SIGIO on the fd and you have a signal delivery mechanism,
>> if you want it.
>
> File descriptors for processes (waitfd/clonefd) are definitely
> interesting.  Especially if reaping the process (and reparenting its
> children) is delayed until the last process file descriptor is closed. 
> However, this would be a much larger addition and also less intuitive
> to use if all you want is killing the process tree.
>
>> For the kill all children when the parent dies the mechanism you are
>> proposing is escapable.  We already have an inescapable version of it
>> with init in a pid namespace.  We already have an escapable version of
>> it with orphaned process groups and SIGHUP.
>> 
>> So I would really appreciate a very clear use case for what we are
>> building here.  As it appears the killing of children can already be
>> done another way, and that the waiting for the parent can be done better
>> another way.
>
> My use case is to provide a way for a process to spawn a child and
> ensure that no descendants survive when that child dies.  Avoiding
> runaway processes is desirable in many situations.  My motivation is
> very lightweight (nested) sandboxing (every process is potentially
> sandboxed).
>
> I.e., pid namespaces would be a pretty good fit (assuming they are
> sufficiently lightweight) but CLONE_NEWPID requires CAP_SYS_ADMIN. 
> User namespaces can help here, but creating tons of user namespaces
> just for this doesn't sound sensible. MAX_PID_NS_LEVEL could be an
> issue as well at some point but 32 levels are likely fine in practice.
>
> For my particular scenario I may actually be able to create a single
> user namespace, run all processes with (namespaced) CAP_SYS_ADMIN and
> use CLONE_NEWPID for every process.  However, I would prefer not
> requiring CAP_SYS_ADMIN and a regular application that wants to avoid
> runaway processes for a spawned helper process cannot rely on
> CAP_SYS_ADMIN.
>
> My plan was to use PR_SET_PDEATHSIG_PROC with PR_NO_NEW_PRIVS and a
> suitable seccomp filter to prevent changes to pdeath_signal_proc.  For
> my SIGKILL use case it would be even better to simply require
> PR_NO_NEW_PRIVS and make pdeath_signal_proc sticky, avoiding the need
> for seccomp.  I wanted to keep the differences to the existing
> PR_SET_PDEATHSIG minimal but if we argue that the non-SIGKILL use case
> is better solved with waitfd (or maybe the process events connector),
> we could tailor the prctl for the SIGKILL use case (or support both via
> prctl arg3).
>
> I have another small patch locally that adds a prctl that restricts
> kill(2) to direct children of the current thread group for lightweight
> sandboxing.  That would also be redundant if it was possible to use
> CLONE_NEWPID for every process.

I believe the current default limits allow using CLONE_NEWPID for every
process.  The data structures seem light enough as well.

> What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN? 
> Does CLONE_NEWPID pose any risks that don't exist for
> CLONE_NEWUSER|CLONE_NEWPID?  Assuming we can't simply drop the
> CAP_SYS_ADMIN requirement, do you see a better solution for this use
> case?

CLONE_NEWPID without a permission check would allow runing a setuid root
application in a pid namespace.  Off the top of my head I can't think of
a really good exploit.  But when you mess up pid files, and hide
information from a privileged application I can completely imagine
forcing that application to misbehave in ways the attacker can control.
Leading to bad things.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 17:40             ` Eric W. Biederman
@ 2017-10-03 17:47               ` Jürg Billeter
  2017-10-03 19:05                 ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-10-03 17:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

On Tue, 2017-10-03 at 12:40 -0500, Eric W. Biederman wrote:
> Jürg Billeter <j@bitron.ch> writes:
> > What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN? 
> > Does CLONE_NEWPID pose any risks that don't exist for
> > CLONE_NEWUSER|CLONE_NEWPID?  Assuming we can't simply drop the
> > CAP_SYS_ADMIN requirement, do you see a better solution for this use
> > case?
> 
> CLONE_NEWPID without a permission check would allow runing a setuid root
> application in a pid namespace.  Off the top of my head I can't think of
> a really good exploit.  But when you mess up pid files, and hide
> information from a privileged application I can completely imagine
> forcing that application to misbehave in ways the attacker can control.
> Leading to bad things.

Could we allow unprivileged CLONE_NEWPID if the no_new_privs bit is
set?

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 17:47               ` Jürg Billeter
@ 2017-10-03 19:05                 ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 19:05 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Andrew Morton, Oleg Nesterov, Linus Torvalds, Michael Kerrisk,
	Filipe Brandenburger, David Wilcox, hansecke, linux-kernel

Jürg Billeter <j@bitron.ch> writes:

> On Tue, 2017-10-03 at 12:40 -0500, Eric W. Biederman wrote:
>> Jürg Billeter <j@bitron.ch> writes:
>> > What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN? 
>> > Does CLONE_NEWPID pose any risks that don't exist for
>> > CLONE_NEWUSER|CLONE_NEWPID?  Assuming we can't simply drop the
>> > CAP_SYS_ADMIN requirement, do you see a better solution for this use
>> > case?
>> 
>> CLONE_NEWPID without a permission check would allow runing a setuid root
>> application in a pid namespace.  Off the top of my head I can't think of
>> a really good exploit.  But when you mess up pid files, and hide
>> information from a privileged application I can completely imagine
>> forcing that application to misbehave in ways the attacker can control.
>> Leading to bad things.
>
> Could we allow unprivileged CLONE_NEWPID if the no_new_privs bit is
> set?

We discussed this early on, and the decision was that no_new_privs would
be kept simple and the user namespace would be what enabled additional
functionality.

Given how much of a challenge dealing with the additional attack surface
of enabling additional functionality in the kernel I think that was the
right call.  That has been the difference between no_new_privs being
done and user namespaces interesting since they have been merged.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 17:02               ` Linus Torvalds
@ 2017-10-03 19:30                 ` Eric W. Biederman
  2017-10-03 20:02                   ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 3, 2017 at 9:36 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> *Scratches head*
>> pdeath_signal is cleared during exec if bprm->cap_elevated.
>
> It's not cleared if we are *releasing* capabilities, which is exactly
> what might cause a "we can no longer send a signal"

*Doh*  Now I see where you are coming from.

>> is_setid is set if the uid != eid or gid != egid.
>
> Again, that may be exactly what changes - the original process may
> have uid != euid, and now we're going from an "we still had a root
> uid/suid" to "dropping everything to euid".
>
> IOW, we're _dropping_ capabilities, not adding them. Maybe we don't
> want to allow signaling the original parent any more.

We never signal the orignal parent.  We signal the child that
requested the pdeath_signal when the original parent dies.

The permission check is:
   Does the parent have permission to signal the child that requested
   the signal.

So the child dropping permissions won't change the result of that
permission check one iota.

The parent dropping permissions may change the result of that permission
check.

I care about this case because we already have the pdeath_signal and the
permission check doesn't make any sense to me, and appears not to be at all
useful.

Plus it is the only caller of group_send_sig_info outside of
kernel/signal.c so I think the kernel would be more maintainable if we
could simplify this piece of code.

> That said, as mentioned, I actually don't think it's a real problem.
> The real problem is entirely conceptual: yet more complexity in an
> area that we've already had problems in before.

>From the conversation so far it does appear we can do better.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 19:30                 ` Eric W. Biederman
@ 2017-10-03 20:02                   ` Linus Torvalds
  2017-10-03 20:32                     ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2017-10-03 20:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

On Tue, Oct 3, 2017 at 12:30 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> We never signal the orignal parent.  We signal the child that
> requested the pdeath_signal when the original parent dies.

Yeah, I keep making that mistake, because I always confuse this with
the exit_signal handling.

Just mentally kick me next time I do that: "Christ, Linus, not
*again*! Take your damn meds"

Anyway, it's more the "another confusing and fragile special case that
will probably not be used very widely and cause confusion because it
lacks any test coverage" thing I worry about most.

                   Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 20:02                   ` Linus Torvalds
@ 2017-10-03 20:32                     ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-03 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jürg Billeter, Andrew Morton, Oleg Nesterov,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	Linux Kernel Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Oct 3, 2017 at 12:30 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> We never signal the orignal parent.  We signal the child that
>> requested the pdeath_signal when the original parent dies.
>
> Yeah, I keep making that mistake, because I always confuse this with
> the exit_signal handling.
>
> Just mentally kick me next time I do that: "Christ, Linus, not
> *again*! Take your damn meds"
>
> Anyway, it's more the "another confusing and fragile special case that
> will probably not be used very widely and cause confusion because it
> lacks any test coverage" thing I worry about most.

Agreed.  That makes a more general solution perferable.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-03 17:00           ` Jürg Billeter
  2017-10-03 17:40             ` Eric W. Biederman
@ 2017-10-05 16:27             ` Oleg Nesterov
  2017-10-08 17:47               ` Jürg Billeter
  1 sibling, 1 reply; 24+ messages in thread
From: Oleg Nesterov @ 2017-10-05 16:27 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Eric W. Biederman, Andrew Morton, Linus Torvalds,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	linux-kernel

On 10/03, Jürg Billeter wrote:
>
> My use case is to provide a way for a process to spawn a child and
> ensure that no descendants survive when that child dies.  Avoiding
> runaway processes is desirable in many situations.  My motivation is
> very lightweight (nested) sandboxing (every process is potentially
> sandboxed).
>
> I.e., pid namespaces would be a pretty good fit (assuming they are
> sufficiently lightweight) but CLONE_NEWPID

sorry if this was already discussed, I didn't read this thread yet...

if CLONE_NEWPID is not suitable for any reason. We already have
PR_SET_CHILD_SUBREAPER. Perhaps we can simply add another
PR_SET_KILL_ALL_DESCEDANTS_ON_EXIT? we can use walk_process_tree()
to send SIGKILL.

Oleg.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-05 16:27             ` Oleg Nesterov
@ 2017-10-08 17:47               ` Jürg Billeter
  2017-10-09 16:32                 ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Jürg Billeter @ 2017-10-08 17:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Andy Lutomirski, Andrew Morton,
	Linus Torvalds, Michael Kerrisk, Filipe Brandenburger,
	David Wilcox, hansecke, linux-kernel

On Thu, 2017-10-05 at 18:27 +0200, Oleg Nesterov wrote:
> On 10/03, Jürg Billeter wrote:
> > 
> > My use case is to provide a way for a process to spawn a child and
> > ensure that no descendants survive when that child dies.  Avoiding
> > runaway processes is desirable in many situations.  My motivation is
> > very lightweight (nested) sandboxing (every process is potentially
> > sandboxed).
> > 
> > I.e., pid namespaces would be a pretty good fit (assuming they are
> > sufficiently lightweight) but CLONE_NEWPID
> 
> sorry if this was already discussed, I didn't read this thread yet...
> 
> if CLONE_NEWPID is not suitable for any reason. We already have
> PR_SET_CHILD_SUBREAPER. Perhaps we can simply add another
> PR_SET_KILL_ALL_DESCEDANTS_ON_EXIT? we can use walk_process_tree()
> to send SIGKILL.

Yes, this is an option.  However, after the discussion in this thread I
believe it would be better to drop the CAP_SYS_ADMIN requirement for
CLONE_NEWPID (when no_new_privs is set) as this would avoid adding
another API and code path for a similar effect.  I'm interested in
possible security concerns about such a change.  Adding Andy Lutomirski
to cc.

Jürg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RESEND PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC
  2017-10-08 17:47               ` Jürg Billeter
@ 2017-10-09 16:32                 ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2017-10-09 16:32 UTC (permalink / raw)
  To: Jürg Billeter
  Cc: Oleg Nesterov, Andy Lutomirski, Andrew Morton, Linus Torvalds,
	Michael Kerrisk, Filipe Brandenburger, David Wilcox, hansecke,
	linux-kernel

Jürg Billeter <j@bitron.ch> writes:

> On Thu, 2017-10-05 at 18:27 +0200, Oleg Nesterov wrote:
>> On 10/03, Jürg Billeter wrote:
>> > 
>> > My use case is to provide a way for a process to spawn a child and
>> > ensure that no descendants survive when that child dies.  Avoiding
>> > runaway processes is desirable in many situations.  My motivation is
>> > very lightweight (nested) sandboxing (every process is potentially
>> > sandboxed).
>> > 
>> > I.e., pid namespaces would be a pretty good fit (assuming they are
>> > sufficiently lightweight) but CLONE_NEWPID
>> 
>> sorry if this was already discussed, I didn't read this thread yet...
>> 
>> if CLONE_NEWPID is not suitable for any reason. We already have
>> PR_SET_CHILD_SUBREAPER. Perhaps we can simply add another
>> PR_SET_KILL_ALL_DESCEDANTS_ON_EXIT? we can use walk_process_tree()
>> to send SIGKILL.
>
> Yes, this is an option.  However, after the discussion in this thread I
> believe it would be better to drop the CAP_SYS_ADMIN requirement for
> CLONE_NEWPID (when no_new_privs is set) as this would avoid adding
> another API and code path for a similar effect.  I'm interested in
> possible security concerns about such a change.  Adding Andy Lutomirski
> to cc.

Absolutely not.  no_new_privs does not need the headache of being
increasing the kernel attack surface.

User namespaces are cheap, use one.  Let the people using no_new_privs
sleep easy.  We don't need to transform no_new_privs into a user namespace.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-10-09 16:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-09  9:40 [PATCH] prctl: add PR_[GS]ET_PDEATHSIG_PROC Jürg Billeter
2017-09-12 17:05 ` Oleg Nesterov
2017-09-12 18:54   ` Jürg Billeter
2017-09-13 17:11     ` Oleg Nesterov
2017-09-13 17:26       ` Jürg Billeter
2017-09-13 17:48         ` Oleg Nesterov
2017-09-29 12:30 ` [RESEND PATCH] " Jürg Billeter
2017-10-02 23:20   ` Andrew Morton
2017-10-03  3:25     ` Eric W. Biederman
2017-10-03  6:45       ` Jürg Billeter
2017-10-03 14:46         ` Eric W. Biederman
2017-10-03 16:10           ` Linus Torvalds
2017-10-03 16:36             ` Eric W. Biederman
2017-10-03 17:02               ` Linus Torvalds
2017-10-03 19:30                 ` Eric W. Biederman
2017-10-03 20:02                   ` Linus Torvalds
2017-10-03 20:32                     ` Eric W. Biederman
2017-10-03 17:00           ` Jürg Billeter
2017-10-03 17:40             ` Eric W. Biederman
2017-10-03 17:47               ` Jürg Billeter
2017-10-03 19:05                 ` Eric W. Biederman
2017-10-05 16:27             ` Oleg Nesterov
2017-10-08 17:47               ` Jürg Billeter
2017-10-09 16:32                 ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).