All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns
@ 2014-02-14 14:13 Andrey Vagin
  2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Andrey Vagin @ 2014-02-14 14:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: criu, Andrey Vagin, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Eric W. Biederman, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

Currently PR_SET_MM_* require the global CAP_SYS_RESOURCE,
which is absent in a non-root userns.

Here are three groups of attributes:
1. PR_SET_MM_START_*_DATA, PR_SET_MM_*BRK, PR_SET_MM_*_STACK
These attributes can affect resource limits, so here is no sense
to restrict them if a proper limit is unlimited.
2. PR_MM_SET_EXE_FILE. We have not found other way than add
a secure bit. This bit is set from a root userns and inhereted by
children. Thanks to Pavel Emelyanov for the idea.
3. All other attributes don't affect other tasks or limits, so
can be changed without special permissions.

Andrey Vagin (3):
  prctl: reduce permissions to change boundaries of data, brk and stack
  capabilities: add a secure bit to allow changing a task exe link
  prctl: allow to use PR_MM_SET_* which affect only a current task

 include/uapi/linux/securebits.h |  9 ++++++++-
 kernel/sys.c                    | 21 +++++++++++++++++++--
 kernel/user_namespace.c         |  3 ++-
 security/commoncap.c            |  7 +++++++
 4 files changed, 36 insertions(+), 4 deletions(-)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Aditya Kali <adityakali@google.com>

-- 
1.8.5.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 14:13 [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns Andrey Vagin
@ 2014-02-14 14:13 ` Andrey Vagin
  2014-02-14 16:05   ` Eric W. Biederman
  2014-02-14 14:13 ` [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link Andrey Vagin
  2014-02-14 14:13 ` [PATCH 3/3] prctl: allow to use PR_MM_SET_* which affect only a current task Andrey Vagin
  2 siblings, 1 reply; 21+ messages in thread
From: Andrey Vagin @ 2014-02-14 14:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: criu, Andrey Vagin, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Eric W. Biederman, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

Currently this operation requires the global CAP_SYS_RESOURCE.
It's required, because a task can exceed limits (RLIMIT_DATA,
RLIMIT_STACK).

So let's allow task to change these parameters if a proper limit is
unlimited.

When we restore a task we need to set up text, data and data heap sizes
from userspace to the values a task had at checkpoint time.

Currently we can not restore these parameters, if a task lives in
a non-root user name space, because it has no capabilities in the
parent namespace.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Aditya Kali <adityakali@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 kernel/sys.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..939370c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1701,8 +1701,23 @@ static int prctl_set_mm(int opt, unsigned long addr,
 	if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
 		return -EINVAL;
 
-	if (!capable(CAP_SYS_RESOURCE))
-		return -EPERM;
+	if (!capable(CAP_SYS_RESOURCE)) {
+		switch (opt) {
+		case PR_SET_MM_START_DATA:
+		case PR_SET_MM_END_DATA:
+		case PR_SET_MM_START_BRK:
+		case PR_SET_MM_BRK:
+			if (rlim < RLIM_INFINITY)
+				return -EPERM;
+			break;
+		case PR_SET_MM_START_STACK:
+			if (rlimit(RLIMIT_STACK) < RLIM_INFINITY)
+				return -EPERM;
+			break;
+		default:
+			return -EPERM;
+		}
+	}
 
 	if (opt == PR_SET_MM_EXE_FILE)
 		return prctl_set_mm_exe_file(mm, (unsigned int)addr);
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link
  2014-02-14 14:13 [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns Andrey Vagin
  2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
@ 2014-02-14 14:13 ` Andrey Vagin
  2014-02-18  4:53   ` Serge E. Hallyn
  2014-02-14 14:13 ` [PATCH 3/3] prctl: allow to use PR_MM_SET_* which affect only a current task Andrey Vagin
  2 siblings, 1 reply; 21+ messages in thread
From: Andrey Vagin @ 2014-02-14 14:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: criu, Andrey Vagin, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Eric W. Biederman, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

When we restore a task we need to restore its exe link from userspace to
the values the task had at checkpoint time.

Currently this operations required the global CAP_SYS_RESOURCE, which is
always absent in a non-root user namespace.

So this patch introduces a new security bit which:
* can be set only if a task has the global CAP_SYS_RESOURCE
* inherited  by  child  processes
* is saved when a task moves in another userns
* allows to change a task exe link even if a task doesn't have CAP_SYS_RESOURCE

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Aditya Kali <adityakali@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/securebits.h | 12 +++++++++++-
 kernel/sys.c                    |  5 +++++
 kernel/user_namespace.c         |  3 ++-
 security/commoncap.c            |  7 +++++++
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h
index 985aac9..c99803b 100644
--- a/include/uapi/linux/securebits.h
+++ b/include/uapi/linux/securebits.h
@@ -43,9 +43,19 @@
 #define SECBIT_KEEP_CAPS	(issecure_mask(SECURE_KEEP_CAPS))
 #define SECBIT_KEEP_CAPS_LOCKED (issecure_mask(SECURE_KEEP_CAPS_LOCKED))
 
+/* When set, a process can do PR_SET_MM_EXE_FILE even if it doesn't
+ * have CAP_SYS_RESOURCE. Setting of this bit requires CAP_SYS_RESOURCE.
+ * This bit is not dropped when a task moves in another userns. */
+#define SECURE_SET_EXE_FILE		6
+#define SECURE_SET_EXE_FILE_LOCKED	7  /* make bit-6 immutable */
+
+#define SECBIT_SET_EXE_FILE	   (issecure_mask(SECURE_SET_EXE_FILE))
+#define SECBIT_SET_EXE_FILE_LOCKED (issecure_mask(SECURE_SET_EXE_FILE_LOCKED))
+
 #define SECURE_ALL_BITS		(issecure_mask(SECURE_NOROOT) | \
 				 issecure_mask(SECURE_NO_SETUID_FIXUP) | \
-				 issecure_mask(SECURE_KEEP_CAPS))
+				 issecure_mask(SECURE_KEEP_CAPS) | \
+				 issecure_mask(SECURE_SET_EXE_FILE))
 #define SECURE_ALL_LOCKS	(SECURE_ALL_BITS << 1)
 
 #endif /* _UAPI_LINUX_SECUREBITS_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 939370c..2f0925d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -18,6 +18,7 @@
 #include <linux/kernel.h>
 #include <linux/workqueue.h>
 #include <linux/capability.h>
+#include <linux/securebits.h>
 #include <linux/device.h>
 #include <linux/key.h>
 #include <linux/times.h>
@@ -1714,6 +1715,10 @@ static int prctl_set_mm(int opt, unsigned long addr,
 			if (rlimit(RLIMIT_STACK) < RLIM_INFINITY)
 				return -EPERM;
 			break;
+		case PR_SET_MM_EXE_FILE:
+			if (!issecure(SECURE_SET_EXE_FILE))
+				return -EPERM;
+			break;
 		default:
 			return -EPERM;
 		}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 240fb62..59584fe 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -34,7 +34,8 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	/* Start with the same capabilities as init but useless for doing
 	 * anything as the capabilities are bound to the new user namespace.
 	 */
-	cred->securebits = SECUREBITS_DEFAULT;
+	cred->securebits = SECUREBITS_DEFAULT |
+				(cred->securebits & SECBIT_SET_EXE_FILE);
 	cred->cap_inheritable = CAP_EMPTY_SET;
 	cred->cap_permitted = CAP_FULL_SET;
 	cred->cap_effective = CAP_FULL_SET;
diff --git a/security/commoncap.c b/security/commoncap.c
index b9d613e..eda1eb8 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -907,6 +907,13 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 		    )
 			/* cannot change a locked bit */
 			goto error;
+
+		/* Setting SECURE_SET_EXE_FILE requires CAP_SYS_RESOURCE */
+		if ((arg2 & SECBIT_SET_EXE_FILE) &&
+		    !(new->securebits & SECBIT_SET_EXE_FILE) &&
+		    !capable(CAP_SYS_RESOURCE))
+			goto error;
+
 		new->securebits = arg2;
 		goto changed;
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] prctl: allow to use PR_MM_SET_* which affect only a current task
  2014-02-14 14:13 [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns Andrey Vagin
  2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
  2014-02-14 14:13 ` [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link Andrey Vagin
@ 2014-02-14 14:13 ` Andrey Vagin
  2 siblings, 0 replies; 21+ messages in thread
From: Andrey Vagin @ 2014-02-14 14:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: criu, Andrey Vagin, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Eric W. Biederman, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

Here are PR_SET_MM_*_CODE, PR_SET_MM_ARG_*, PR_SET_MM_ENV_*,
PR_SET_MM_AUXV.

Looks like all these parameters can break only the current task and
can not affect other tasks or limits.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Aditya Kali <adityakali@google.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 kernel/sys.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 2f0925d..154cf47 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1719,8 +1719,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
 			if (!issecure(SECURE_SET_EXE_FILE))
 				return -EPERM;
 			break;
-		default:
-			return -EPERM;
+		/* Other options don't require special capabilities */
 		}
 	}
 
-- 
1.8.5.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
@ 2014-02-14 16:05   ` Eric W. Biederman
  2014-02-14 17:43     ` Andrew Vagin
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-14 16:05 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-kernel, criu, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Stephen Rothwell, Pavel Emelyanov, Aditya Kali

Andrey Vagin <avagin@openvz.org> writes:

> Currently this operation requires the global CAP_SYS_RESOURCE.
> It's required, because a task can exceed limits (RLIMIT_DATA,
> RLIMIT_STACK).
>
> So let's allow task to change these parameters if a proper limit is
> unlimited.
>
> When we restore a task we need to set up text, data and data heap sizes
> from userspace to the values a task had at checkpoint time.
>
> Currently we can not restore these parameters, if a task lives in
> a non-root user name space, because it has no capabilities in the
> parent namespace.

My brain hurts just looking at this patch and how you are justifying it.

For the resources you are mucking with below all you have to do is to
verify that you are below the appropriate rlimit at all times and no
CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
to exceed your per process limits.

All you have to do is to fix the current code to properly enforce the
limits.  This half-assed code that forgets the permission checks if
rlimit is set to rlimit_inifinity is wrong.

Eric


> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Aditya Kali <adityakali@google.com>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  kernel/sys.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c0a58be..939370c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1701,8 +1701,23 @@ static int prctl_set_mm(int opt, unsigned long addr,
>  	if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
>  		return -EINVAL;
>  
> -	if (!capable(CAP_SYS_RESOURCE))
> -		return -EPERM;
> +	if (!capable(CAP_SYS_RESOURCE)) {
> +		switch (opt) {
> +		case PR_SET_MM_START_DATA:
> +		case PR_SET_MM_END_DATA:
> +		case PR_SET_MM_START_BRK:
> +		case PR_SET_MM_BRK:
> +			if (rlim < RLIM_INFINITY)
> +				return -EPERM;
> +			break;
> +		case PR_SET_MM_START_STACK:
> +			if (rlimit(RLIMIT_STACK) < RLIM_INFINITY)
> +				return -EPERM;
> +			break;
> +		default:
> +			return -EPERM;
> +		}
> +	}
>  
>  	if (opt == PR_SET_MM_EXE_FILE)
>  		return prctl_set_mm_exe_file(mm, (unsigned int)addr);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 16:05   ` Eric W. Biederman
@ 2014-02-14 17:43     ` Andrew Vagin
  2014-02-14 18:01       ` [CRIU] " Cyrill Gorcunov
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Vagin @ 2014-02-14 17:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrey Vagin, linux-kernel, criu, Andrew Morton, Oleg Nesterov,
	Al Viro, Kees Cook, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

On Fri, Feb 14, 2014 at 08:05:42AM -0800, Eric W. Biederman wrote:
> Andrey Vagin <avagin@openvz.org> writes:
> 
> > Currently this operation requires the global CAP_SYS_RESOURCE.
> > It's required, because a task can exceed limits (RLIMIT_DATA,
> > RLIMIT_STACK).
> >
> > So let's allow task to change these parameters if a proper limit is
> > unlimited.
> >
> > When we restore a task we need to set up text, data and data heap sizes
> > from userspace to the values a task had at checkpoint time.
> >
> > Currently we can not restore these parameters, if a task lives in
> > a non-root user name space, because it has no capabilities in the
> > parent namespace.
> 
> My brain hurts just looking at this patch and how you are justifying it.
> 
> For the resources you are mucking with below all you have to do is to
> verify that you are below the appropriate rlimit at all times and no
> CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
> to exceed your per process limits.
> 
> All you have to do is to fix the current code to properly enforce the
> limits.

I'm afraid what you are suggesting doesn't work.

The first reason is that we can not change both boundaries in one call.
But when we are restoring these attributes, we may need to move their
too far.

Another problem is that the limits will not work at all in this case. We
will able to move start_brk forward before calling brk() and brk() will
never fail.

Sorry if I miss something.

> This half-assed code that forgets the permission checks if
> rlimit is set to rlimit_inifinity is wrong.
> 
> Eric
> 
> 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Aditya Kali <adityakali@google.com>
> > Signed-off-by: Andrey Vagin <avagin@openvz.org>
> > ---
> >  kernel/sys.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index c0a58be..939370c 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1701,8 +1701,23 @@ static int prctl_set_mm(int opt, unsigned long addr,
> >  	if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
> >  		return -EINVAL;
> >  
> > -	if (!capable(CAP_SYS_RESOURCE))
> > -		return -EPERM;
> > +	if (!capable(CAP_SYS_RESOURCE)) {
> > +		switch (opt) {
> > +		case PR_SET_MM_START_DATA:
> > +		case PR_SET_MM_END_DATA:
> > +		case PR_SET_MM_START_BRK:
> > +		case PR_SET_MM_BRK:
> > +			if (rlim < RLIM_INFINITY)
> > +				return -EPERM;
> > +			break;
> > +		case PR_SET_MM_START_STACK:
> > +			if (rlimit(RLIMIT_STACK) < RLIM_INFINITY)
> > +				return -EPERM;
> > +			break;
> > +		default:
> > +			return -EPERM;
> > +		}
> > +	}
> >  
> >  	if (opt == PR_SET_MM_EXE_FILE)
> >  		return prctl_set_mm_exe_file(mm, (unsigned int)addr);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 17:43     ` Andrew Vagin
@ 2014-02-14 18:01       ` Cyrill Gorcunov
  2014-02-14 19:16         ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov @ 2014-02-14 18:01 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Eric W. Biederman, Aditya Kali, Stephen Rothwell,
	Pavel Emelyanov, Oleg Nesterov, linux-kernel, criu, Al Viro,
	Andrew Morton, Kees Cook

On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
> > My brain hurts just looking at this patch and how you are justifying it.
> > 
> > For the resources you are mucking with below all you have to do is to
> > verify that you are below the appropriate rlimit at all times and no
> > CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
> > to exceed your per process limits.
> > 
> > All you have to do is to fix the current code to properly enforce the
> > limits.
> 
> I'm afraid what you are suggesting doesn't work.
> 
> The first reason is that we can not change both boundaries in one call.
> But when we are restoring these attributes, we may need to move their
> too far.

When this code was introduced, there were no user-namespace implementation,
if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
to prevent modifying this values by anyone. Now user-ns brings a limit --
we need somehow to provide a way to modify these mm fields having no
CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
we're modifying members one by one (looking back I think this was not
a good idea to modify the fields in this manner).

Maybe we could improve this api and provide argument as a pointer
to a structure, which would have all the fields we're going to
modify, which in turn would allow us to verify that all new values
are sane and fit rlimits, then we could (probably) deprecate old
api if noone except c/r camp is using it (I actually can't imagine
who else might need this api). Then CAP_SYS_RESOURCE requirement
could be ripped off. Hm? (sure touching api is always "no-no"
case, but maybe...)

> 
> Another problem is that the limits will not work at all in this case. We
> will able to move start_brk forward before calling brk() and brk() will
> never fail.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 18:01       ` [CRIU] " Cyrill Gorcunov
@ 2014-02-14 19:16         ` Eric W. Biederman
  2014-02-14 19:47           ` Pavel Emelyanov
  2014-02-14 20:44           ` Andrey Wagin
  0 siblings, 2 replies; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-14 19:16 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Vagin, Aditya Kali, Stephen Rothwell, Pavel Emelyanov,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>> > My brain hurts just looking at this patch and how you are justifying it.
>> > 
>> > For the resources you are mucking with below all you have to do is to
>> > verify that you are below the appropriate rlimit at all times and no
>> > CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>> > to exceed your per process limits.
>> > 
>> > All you have to do is to fix the current code to properly enforce the
>> > limits.
>> 
>> I'm afraid what you are suggesting doesn't work.
>> 
>> The first reason is that we can not change both boundaries in one call.
>> But when we are restoring these attributes, we may need to move their
>> too far.
>
> When this code was introduced, there were no user-namespace implementation,
> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
> to prevent modifying this values by anyone. Now user-ns brings a limit --
> we need somehow to provide a way to modify these mm fields having no
> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
> we're modifying members one by one (looking back I think this was not
> a good idea to modify the fields in this manner).
>
> Maybe we could improve this api and provide argument as a pointer
> to a structure, which would have all the fields we're going to
> modify, which in turn would allow us to verify that all new values
> are sane and fit rlimits, then we could (probably) deprecate old
> api if noone except c/r camp is using it (I actually can't imagine
> who else might need this api). Then CAP_SYS_RESOURCE requirement
> could be ripped off. Hm? (sure touching api is always "no-no"
> case, but maybe...)

Hmm.  Let me rewind this a little bit.

I want to be very stupid and ask the following.

Why can't you have the process of interest do:
	ptrace(PTRACE_ATTACHME);
	execve(executable, args, ...);
        
        /* Have the ptracer inject the recovery/fixup code */
	/* Fix up the mostly correct process to look like it has been
         * executing for a while.
         */

That should work, set all of the interesting fields, and works as
non-root today.  My gut feel says do that and we can just
deprecate/remove prctl_set_mm.

I am hoping we can move this conversation what makes sense from oh ick
checkpoint/restort does not work with user namespaces.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 19:16         ` Eric W. Biederman
@ 2014-02-14 19:47           ` Pavel Emelyanov
  2014-02-14 20:06             ` Cyrill Gorcunov
  2014-02-14 20:09             ` Eric W. Biederman
  2014-02-14 20:44           ` Andrey Wagin
  1 sibling, 2 replies; 21+ messages in thread
From: Pavel Emelyanov @ 2014-02-14 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cyrill Gorcunov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On 02/14/2014 11:16 PM, Eric W. Biederman wrote:
> Cyrill Gorcunov <gorcunov@gmail.com> writes:
> 
>> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>>>> My brain hurts just looking at this patch and how you are justifying it.
>>>>
>>>> For the resources you are mucking with below all you have to do is to
>>>> verify that you are below the appropriate rlimit at all times and no
>>>> CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>>>> to exceed your per process limits.
>>>>
>>>> All you have to do is to fix the current code to properly enforce the
>>>> limits.
>>>
>>> I'm afraid what you are suggesting doesn't work.
>>>
>>> The first reason is that we can not change both boundaries in one call.
>>> But when we are restoring these attributes, we may need to move their
>>> too far.
>>
>> When this code was introduced, there were no user-namespace implementation,
>> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
>> to prevent modifying this values by anyone. Now user-ns brings a limit --
>> we need somehow to provide a way to modify these mm fields having no
>> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
>> we're modifying members one by one (looking back I think this was not
>> a good idea to modify the fields in this manner).
>>
>> Maybe we could improve this api and provide argument as a pointer
>> to a structure, which would have all the fields we're going to
>> modify, which in turn would allow us to verify that all new values
>> are sane and fit rlimits, then we could (probably) deprecate old
>> api if noone except c/r camp is using it (I actually can't imagine
>> who else might need this api). Then CAP_SYS_RESOURCE requirement
>> could be ripped off. Hm? (sure touching api is always "no-no"
>> case, but maybe...)
> 
> Hmm.  Let me rewind this a little bit.
> 
> I want to be very stupid and ask the following.
> 
> Why can't you have the process of interest do:
> 	ptrace(PTRACE_ATTACHME);
> 	execve(executable, args, ...);
>         
>         /* Have the ptracer inject the recovery/fixup code */
> 	/* Fix up the mostly correct process to look like it has been
>          * executing for a while.
>          */

Let's imagine we do that.

This means, that the whole memory contents should be restored _after_
the execve() call, since the execve() flushes old mappings. In
that case we lose the ability to preserve any shared memory regions
between any two processes. This "shared" can be either regular
MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.

> That should work, set all of the interesting fields, and works as
> non-root today.  My gut feel says do that and we can just
> deprecate/remove prctl_set_mm.
> 
> I am hoping we can move this conversation what makes sense from oh ick
> checkpoint/restort does not work with user namespaces.
> 
> Eric
> .

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 19:47           ` Pavel Emelyanov
@ 2014-02-14 20:06             ` Cyrill Gorcunov
  2014-02-14 20:18               ` Eric W. Biederman
  2014-02-14 20:09             ` Eric W. Biederman
  1 sibling, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov @ 2014-02-14 20:06 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On Fri, Feb 14, 2014 at 11:47:13PM +0400, Pavel Emelyanov wrote:
> >> Maybe we could improve this api and provide argument as a pointer
> >> to a structure, which would have all the fields we're going to
> >> modify, which in turn would allow us to verify that all new values
> >> are sane and fit rlimits, then we could (probably) deprecate old
> >> api if noone except c/r camp is using it (I actually can't imagine
> >> who else might need this api). Then CAP_SYS_RESOURCE requirement
> >> could be ripped off. Hm? (sure touching api is always "no-no"
> >> case, but maybe...)
> > 
> > Hmm.  Let me rewind this a little bit.
> > 
> > I want to be very stupid and ask the following.
> > 
> > Why can't you have the process of interest do:
> > 	ptrace(PTRACE_ATTACHME);
> > 	execve(executable, args, ...);
> >         
> >         /* Have the ptracer inject the recovery/fixup code */
> > 	    /* Fix up the mostly correct process to look like it has been
> >          * executing for a while.
> >          */

Erik, it seems I don't understand how it will help us to restore
the mm fields mentioned above?

> Let's imagine we do that.
> 
> This means, that the whole memory contents should be restored _after_
> the execve() call, since the execve() flushes old mappings. In
> that case we lose the ability to preserve any shared memory regions
> between any two processes. This "shared" can be either regular
> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.
> 
> > That should work, set all of the interesting fields, and works as
> > non-root today.  My gut feel says do that and we can just
> > deprecate/remove prctl_set_mm.
> > 
> > I am hoping we can move this conversation what makes sense from oh ick
> > checkpoint/restort does not work with user namespaces.

I fear you've got a wrong impression that we're "ick'ing" about user-ns ;)
Actually it's "must have" feature for containers thus we would _really_
love to be able to c/r them.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 19:47           ` Pavel Emelyanov
  2014-02-14 20:06             ` Cyrill Gorcunov
@ 2014-02-14 20:09             ` Eric W. Biederman
  2014-02-17  8:34               ` Pavel Emelyanov
  1 sibling, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-14 20:09 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

Pavel Emelyanov <xemul@parallels.com> writes:

> On 02/14/2014 11:16 PM, Eric W. Biederman wrote:
>> Cyrill Gorcunov <gorcunov@gmail.com> writes:
>> 
>>> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>>>>> My brain hurts just looking at this patch and how you are justifying it.
>>>>>
>>>>> For the resources you are mucking with below all you have to do is to
>>>>> verify that you are below the appropriate rlimit at all times and no
>>>>> CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>>>>> to exceed your per process limits.
>>>>>
>>>>> All you have to do is to fix the current code to properly enforce the
>>>>> limits.
>>>>
>>>> I'm afraid what you are suggesting doesn't work.
>>>>
>>>> The first reason is that we can not change both boundaries in one call.
>>>> But when we are restoring these attributes, we may need to move their
>>>> too far.
>>>
>>> When this code was introduced, there were no user-namespace implementation,
>>> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
>>> to prevent modifying this values by anyone. Now user-ns brings a limit --
>>> we need somehow to provide a way to modify these mm fields having no
>>> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
>>> we're modifying members one by one (looking back I think this was not
>>> a good idea to modify the fields in this manner).
>>>
>>> Maybe we could improve this api and provide argument as a pointer
>>> to a structure, which would have all the fields we're going to
>>> modify, which in turn would allow us to verify that all new values
>>> are sane and fit rlimits, then we could (probably) deprecate old
>>> api if noone except c/r camp is using it (I actually can't imagine
>>> who else might need this api). Then CAP_SYS_RESOURCE requirement
>>> could be ripped off. Hm? (sure touching api is always "no-no"
>>> case, but maybe...)
>> 
>> Hmm.  Let me rewind this a little bit.
>> 
>> I want to be very stupid and ask the following.
>> 
>> Why can't you have the process of interest do:
>> 	ptrace(PTRACE_ATTACHME);
>> 	execve(executable, args, ...);
>>         
>>         /* Have the ptracer inject the recovery/fixup code */
>> 	/* Fix up the mostly correct process to look like it has been
>>          * executing for a while.
>>          */
>
> Let's imagine we do that.
>
> This means, that the whole memory contents should be restored _after_
> the execve() call, since the execve() flushes old mappings. In
> that case we lose the ability to preserve any shared memory regions
> between any two processes. This "shared" can be either regular
> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.

If we have MAP_ANONYMOUS but not COW-ed mappings we have the correct
executable, which implies we have everything else correct except for the
brk and the stack addresses, because the process was started with fork.

So while that sounds like an interesting case to handle it does not seem
to invalidate the idea of using exec to set all of the other fields when
we need to set them.

Eric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 20:06             ` Cyrill Gorcunov
@ 2014-02-14 20:18               ` Eric W. Biederman
  2014-02-15  6:29                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-14 20:18 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Fri, Feb 14, 2014 at 11:47:13PM +0400, Pavel Emelyanov wrote:
>> >> Maybe we could improve this api and provide argument as a pointer
>> >> to a structure, which would have all the fields we're going to
>> >> modify, which in turn would allow us to verify that all new values
>> >> are sane and fit rlimits, then we could (probably) deprecate old
>> >> api if noone except c/r camp is using it (I actually can't imagine
>> >> who else might need this api). Then CAP_SYS_RESOURCE requirement
>> >> could be ripped off. Hm? (sure touching api is always "no-no"
>> >> case, but maybe...)
>> > 
>> > Hmm.  Let me rewind this a little bit.
>> > 
>> > I want to be very stupid and ask the following.
>> > 
>> > Why can't you have the process of interest do:
>> > 	ptrace(PTRACE_ATTACHME);
>> > 	execve(executable, args, ...);
>> >         
>> >         /* Have the ptracer inject the recovery/fixup code */
>> > 	    /* Fix up the mostly correct process to look like it has been
>> >          * executing for a while.
>> >          */
>
> Erik, it seems I don't understand how it will help us to restore
> the mm fields mentioned above?

Because exec is how those mm fields are set when you don't use
prctl_set_mm.  So execpt for the stack and the brk limits that
will simply result in the values being set to what the usually
would be set to.

>> Let's imagine we do that.
>> 
>> This means, that the whole memory contents should be restored _after_
>> the execve() call, since the execve() flushes old mappings. In
>> that case we lose the ability to preserve any shared memory regions
>> between any two processes. This "shared" can be either regular
>> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.
>> 
>> > That should work, set all of the interesting fields, and works as
>> > non-root today.  My gut feel says do that and we can just
>> > deprecate/remove prctl_set_mm.
>> > 
>> > I am hoping we can move this conversation what makes sense from oh ick
>> > checkpoint/restort does not work with user namespaces.
>
> I fear you've got a wrong impression that we're "ick'ing" about user-ns ;)
> Actually it's "must have" feature for containers thus we would _really_
> love to be able to c/r them.

What I meant is that the analysis of how to deal with prctl_set_mm seems
to be knee jerk shallow analysis based upon the fact that things are not
working, and not asking the question what really makes sense here? and
Why are people concerned with these changes and these values?

Eric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 19:16         ` Eric W. Biederman
  2014-02-14 19:47           ` Pavel Emelyanov
@ 2014-02-14 20:44           ` Andrey Wagin
  2014-02-15 23:05             ` Eric W. Biederman
  1 sibling, 1 reply; 21+ messages in thread
From: Andrey Wagin @ 2014-02-14 20:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cyrill Gorcunov, Aditya Kali, Stephen Rothwell, Pavel Emelyanov,
	Oleg Nesterov, LKML, criu, Al Viro, Andrew Morton, Kees Cook

2014-02-14 23:16 GMT+04:00 Eric W. Biederman <ebiederm@xmission.com>:
> Cyrill Gorcunov <gorcunov@gmail.com> writes:
>
>> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>>> > My brain hurts just looking at this patch and how you are justifying it.
>>> >
>>> > For the resources you are mucking with below all you have to do is to
>>> > verify that you are below the appropriate rlimit at all times and no
>>> > CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>>> > to exceed your per process limits.
>>> >
>>> > All you have to do is to fix the current code to properly enforce the
>>> > limits.
>>>
>>> I'm afraid what you are suggesting doesn't work.
>>>
>>> The first reason is that we can not change both boundaries in one call.
>>> But when we are restoring these attributes, we may need to move their
>>> too far.
>>
>> When this code was introduced, there were no user-namespace implementation,
>> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
>> to prevent modifying this values by anyone. Now user-ns brings a limit --
>> we need somehow to provide a way to modify these mm fields having no
>> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
>> we're modifying members one by one (looking back I think this was not
>> a good idea to modify the fields in this manner).
>>
>> Maybe we could improve this api and provide argument as a pointer
>> to a structure, which would have all the fields we're going to
>> modify, which in turn would allow us to verify that all new values
>> are sane and fit rlimits, then we could (probably) deprecate old
>> api if noone except c/r camp is using it (I actually can't imagine
>> who else might need this api). Then CAP_SYS_RESOURCE requirement
>> could be ripped off. Hm? (sure touching api is always "no-no"
>> case, but maybe...)
>
> Hmm.  Let me rewind this a little bit.
>
> I want to be very stupid and ask the following.
>
> Why can't you have the process of interest do:
>         ptrace(PTRACE_ATTACHME);
>         execve(executable, args, ...);
>
>         /* Have the ptracer inject the recovery/fixup code */
>         /* Fix up the mostly correct process to look like it has been
>          * executing for a while.
>          */
>
> That should work, set all of the interesting fields, and works as
> non-root today.  My gut feel says do that and we can just
> deprecate/remove prctl_set_mm.

start_brk and start_stack are randomized each time. I don't understand
how execve() can restore the origin values of attributes.

>
> I am hoping we can move this conversation what makes sense from oh ick
> checkpoint/restort does not work with user namespaces.
>
> Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 20:18               ` Eric W. Biederman
@ 2014-02-15  6:29                 ` Cyrill Gorcunov
  2014-02-15 23:01                   ` Eric W. Biederman
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov @ 2014-02-15  6:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Pavel Emelyanov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On Fri, Feb 14, 2014 at 12:18:46PM -0800, Eric W. Biederman wrote:
> >> > 
> >> > Why can't you have the process of interest do:
> >> > 	ptrace(PTRACE_ATTACHME);
> >> > 	execve(executable, args, ...);
> >> >         
> >> >         /* Have the ptracer inject the recovery/fixup code */
> >> > 	    /* Fix up the mostly correct process to look like it has been
> >> >          * executing for a while.
> >> >          */
> >
> > Erik, it seems I don't understand how it will help us to restore
> > the mm fields mentioned above?
> 
> Because exec is how those mm fields are set when you don't use
> prctl_set_mm.  So execpt for the stack and the brk limits that
> will simply result in the values being set to what the usually
> would be set to.

Yes, all these fields are set up by kernel's elf loader but this
routine is a way more time consuming than a clone call. But gimme
some time to examine all possible problems we might have with such
approach and if there a way to solve them.

> >> Let's imagine we do that.
> >> 
> >> This means, that the whole memory contents should be restored _after_
> >> the execve() call, since the execve() flushes old mappings. In
> >> that case we lose the ability to preserve any shared memory regions
> >> between any two processes. This "shared" can be either regular
> >> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-15  6:29                 ` Cyrill Gorcunov
@ 2014-02-15 23:01                   ` Eric W. Biederman
  0 siblings, 0 replies; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-15 23:01 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Fri, Feb 14, 2014 at 12:18:46PM -0800, Eric W. Biederman wrote:
>> >> > 
>> >> > Why can't you have the process of interest do:
>> >> > 	ptrace(PTRACE_ATTACHME);
>> >> > 	execve(executable, args, ...);
>> >> >         
>> >> >         /* Have the ptracer inject the recovery/fixup code */
>> >> > 	    /* Fix up the mostly correct process to look like it has been
>> >> >          * executing for a while.
>> >> >          */
>> >
>> > Erik, it seems I don't understand how it will help us to restore
>> > the mm fields mentioned above?
>> 
>> Because exec is how those mm fields are set when you don't use
>> prctl_set_mm.  So execpt for the stack and the brk limits that
>> will simply result in the values being set to what the usually
>> would be set to.
>
> Yes, all these fields are set up by kernel's elf loader but this
> routine is a way more time consuming than a clone call. But gimme
> some time to examine all possible problems we might have with such
> approach and if there a way to solve them.

Sure.

The really useful observation in all of this is that with exec we have
methods where we allow unprivileged setting of these fields already.  So
it is essentially concerns about applictions being stupid (resource
control) and applications being compromised with evil code and the trace
evidence being hidden that we are trying to protect by limiting changes
to these fields.

So if we can come up with a method that doesn't violate those
invariants, and doesn't lead to massive code maintenance we should be
good.  Reusing exec is just the easiest way to get there.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 20:44           ` Andrey Wagin
@ 2014-02-15 23:05             ` Eric W. Biederman
  0 siblings, 0 replies; 21+ messages in thread
From: Eric W. Biederman @ 2014-02-15 23:05 UTC (permalink / raw)
  To: Andrey Wagin
  Cc: Cyrill Gorcunov, Aditya Kali, Stephen Rothwell, Pavel Emelyanov,
	Oleg Nesterov, LKML, criu, Al Viro, Andrew Morton, Kees Cook

Andrey Wagin <avagin@gmail.com> writes:

> 2014-02-14 23:16 GMT+04:00 Eric W. Biederman <ebiederm@xmission.com>:
>>
>> Hmm.  Let me rewind this a little bit.
>>
>> I want to be very stupid and ask the following.
>>
>> Why can't you have the process of interest do:
>>         ptrace(PTRACE_ATTACHME);
>>         execve(executable, args, ...);
>>
>>         /* Have the ptracer inject the recovery/fixup code */
>>         /* Fix up the mostly correct process to look like it has been
>>          * executing for a while.
>>          */
>>
>> That should work, set all of the interesting fields, and works as
>> non-root today.  My gut feel says do that and we can just
>> deprecate/remove prctl_set_mm.
>
> start_brk and start_stack are randomized each time. I don't understand
> how execve() can restore the origin values of attributes.

As is the location of the vdso and there isn't a way to set that.

So perhaps what we want to do is to change the randomization with
mremap(old_addr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, new_addr)
and just have the kernel update all of the addresses in bulk when we
move the location.

I don't know what the folks who are worried about losing tampering
evidence will think but as a targeted special case it may not be at all
crazy.

Eric

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-14 20:09             ` Eric W. Biederman
@ 2014-02-17  8:34               ` Pavel Emelyanov
  2014-02-17  8:52                 ` Cyrill Gorcunov
  2014-03-07 13:51                 ` Pavel Emelyanov
  0 siblings, 2 replies; 21+ messages in thread
From: Pavel Emelyanov @ 2014-02-17  8:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cyrill Gorcunov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On 02/15/2014 12:09 AM, Eric W. Biederman wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> On 02/14/2014 11:16 PM, Eric W. Biederman wrote:
>>> Cyrill Gorcunov <gorcunov@gmail.com> writes:
>>>
>>>> On Fri, Feb 14, 2014 at 09:43:14PM +0400, Andrew Vagin wrote:
>>>>>> My brain hurts just looking at this patch and how you are justifying it.
>>>>>>
>>>>>> For the resources you are mucking with below all you have to do is to
>>>>>> verify that you are below the appropriate rlimit at all times and no
>>>>>> CAP_SYS_RESOURCE check is needed.  You only need CAP_SYS_RESOURCE
>>>>>> to exceed your per process limits.
>>>>>>
>>>>>> All you have to do is to fix the current code to properly enforce the
>>>>>> limits.
>>>>>
>>>>> I'm afraid what you are suggesting doesn't work.
>>>>>
>>>>> The first reason is that we can not change both boundaries in one call.
>>>>> But when we are restoring these attributes, we may need to move their
>>>>> too far.
>>>>
>>>> When this code was introduced, there were no user-namespace implementation,
>>>> if I remember correctly, so CAP_SYS_RESOURCE was enough barrier point
>>>> to prevent modifying this values by anyone. Now user-ns brings a limit --
>>>> we need somehow to provide a way to modify these mm fields having no
>>>> CAP_SYS_RESOURCE set. "Verifying rlimit" not an option here because
>>>> we're modifying members one by one (looking back I think this was not
>>>> a good idea to modify the fields in this manner).
>>>>
>>>> Maybe we could improve this api and provide argument as a pointer
>>>> to a structure, which would have all the fields we're going to
>>>> modify, which in turn would allow us to verify that all new values
>>>> are sane and fit rlimits, then we could (probably) deprecate old
>>>> api if noone except c/r camp is using it (I actually can't imagine
>>>> who else might need this api). Then CAP_SYS_RESOURCE requirement
>>>> could be ripped off. Hm? (sure touching api is always "no-no"
>>>> case, but maybe...)
>>>
>>> Hmm.  Let me rewind this a little bit.
>>>
>>> I want to be very stupid and ask the following.
>>>
>>> Why can't you have the process of interest do:
>>> 	ptrace(PTRACE_ATTACHME);
>>> 	execve(executable, args, ...);
>>>         
>>>         /* Have the ptracer inject the recovery/fixup code */
>>> 	/* Fix up the mostly correct process to look like it has been
>>>          * executing for a while.
>>>          */
>>
>> Let's imagine we do that.
>>
>> This means, that the whole memory contents should be restored _after_
>> the execve() call, since the execve() flushes old mappings. In
>> that case we lose the ability to preserve any shared memory regions
>> between any two processes. This "shared" can be either regular
>> MAP_SHARED mappings or MAP_ANONYMOUS but still not COW-ed ones.
> 
> If we have MAP_ANONYMOUS but not COW-ed mappings we have the correct
> executable, which implies we have everything else correct except for the
> brk and the stack addresses, because the process was started with fork.
> 
> So while that sounds like an interesting case to handle it does not seem
> to invalidate the idea of using exec to set all of the other fields when
> we need to set them.

Well, yes, what you propose we call "inheritable resources". These are, e.g.
SIDs or shared FD-table/MM-s. That's OK to restore them at fork(), but I'd like
to draw your attention to two concerns I have with this approach.

1. Inheritable resources can be potentially restored more than one time. Consider
   you have tasks tree look like this:

task-A  -[exe]-> A
 `- task-C1 -[exe]-> C
 `- task-C2 -[exe]-> C

IOW -- task A has executable A and two kids C1 and C2 that share executable C.

In that case the restore sequence should look like this

* Task A calls execve() on C
* Task A forks C1
* Task A forks C2
* Task A calls execve() on A

This does work, I agree, but task A has to call execve() two times. And even more, if
we had e.g. D1 and D2 kids with different exe D. Now, why I think that's a problem?
Please, see concern #2 :)


2. What you propose means we have to effectively strace and execve-ing task. As
compared with plain prlctl this is up to ~600 times slower. I've made such an experiment.

* Idle node with plenty of free RAM
* Simple proggie doing execve() on self for 1000 times, compiled statically to avoid
  ld.so spoiling the times, run under strace
* Another proggie doing open() + prlctl() 1000 times.

The first task took ~12 sec to complete. The second -- ~0.02 seconds.

If we take an average container of 100 tasks, even with all different exe links, your
approach would give us ~1 sec more to restore, while existing one would be almost
no op. And this hits us even without the inheritance scenario I demonstrated above.

Please, keep in mind, that checkpoint-restore in not only live-migration, we have
use cases where restore cannot be pre-restored for better down-time. It _must_ be
as fast as possible.



That said, Eric, I do agree with your concern about security, I _am_ ready to rework
this stuff and kill the whole bunch of prctls we have. But please! Very please! Can
we come up with mm->foo-s and ->exe_link restoration API that is at most ... 5 times
slower than existing prlctl? It's really-really important for us!

Maybe we can make prlctl() do lite-execve()? It will open the executable, read the
required amount of headers and just put data red from there onto mm-struct? This 
should be MUCH better, that full execve() with loading all binary data plus strace
and flushing old mm-s.

> Eric

Thanks,
Pavel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-17  8:34               ` Pavel Emelyanov
@ 2014-02-17  8:52                 ` Cyrill Gorcunov
  2014-02-17 16:57                   ` Pavel Emelyanov
  2014-03-07 13:51                 ` Pavel Emelyanov
  1 sibling, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov @ 2014-02-17  8:52 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On Mon, Feb 17, 2014 at 12:34:12PM +0400, Pavel Emelyanov wrote:
...
> Maybe we can make prlctl() do lite-execve()? It will open the executable, read the
> required amount of headers and just put data red from there onto mm-struct? This 
> should be MUCH better, that full execve() with loading all binary data plus strace
> and flushing old mm-s.

Well, this would be good, except I don't know how would we deal with executables
which are running but deleted, where would we fetch these headers from? (Note the
program can map new executable region, jump there and unmap own text section).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-17  8:52                 ` Cyrill Gorcunov
@ 2014-02-17 16:57                   ` Pavel Emelyanov
  0 siblings, 0 replies; 21+ messages in thread
From: Pavel Emelyanov @ 2014-02-17 16:57 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

On 02/17/2014 12:52 PM, Cyrill Gorcunov wrote:
> On Mon, Feb 17, 2014 at 12:34:12PM +0400, Pavel Emelyanov wrote:
> ...
>> Maybe we can make prlctl() do lite-execve()? It will open the executable, read the
>> required amount of headers and just put data red from there onto mm-struct? This 
>> should be MUCH better, that full execve() with loading all binary data plus strace
>> and flushing old mm-s.
> 
> Well, this would be good, except I don't know how would we deal with executables
> which are running but deleted, where would we fetch these headers from? (Note the
> program can map new executable region, jump there and unmap own text section).

If we meet opened but unlinked file we handle this by either carrying the file
with us, or by creating a hard-link when possible. Anyway, this is not a problem,
we can handle this.

As far as unmapped .text section is concerned. It also doesn't matter much. I we
agree on API that doesn't flush old mappings (I really hope we do), we will be
able to remap them as needed.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link
  2014-02-14 14:13 ` [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link Andrey Vagin
@ 2014-02-18  4:53   ` Serge E. Hallyn
  0 siblings, 0 replies; 21+ messages in thread
From: Serge E. Hallyn @ 2014-02-18  4:53 UTC (permalink / raw)
  To: Andrey Vagin
  Cc: linux-kernel, criu, Andrew Morton, Oleg Nesterov, Al Viro,
	Kees Cook, Eric W. Biederman, Stephen Rothwell, Pavel Emelyanov,
	Aditya Kali

Quoting Andrey Vagin (avagin@openvz.org):
> When we restore a task we need to restore its exe link from userspace to
> the values the task had at checkpoint time.
> 
> Currently this operations required the global CAP_SYS_RESOURCE, which is
> always absent in a non-root user namespace.
> 
> So this patch introduces a new security bit which:
> * can be set only if a task has the global CAP_SYS_RESOURCE
> * inherited  by  child  processes
> * is saved when a task moves in another userns
> * allows to change a task exe link even if a task doesn't have CAP_SYS_RESOURCE

I'm late to this party anyway, but fwiw I don't like this use
of securebits.  It also seems to prevent c/r in a nested
container anyway so wouldn't seem to suffice.

But I assume I don't really need to argue it as it appears Pavel
and Eric are looking into a better all-around design.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Aditya Kali <adityakali@google.com>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>
> ---
>  include/uapi/linux/securebits.h | 12 +++++++++++-
>  kernel/sys.c                    |  5 +++++
>  kernel/user_namespace.c         |  3 ++-
>  security/commoncap.c            |  7 +++++++
>  4 files changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h
> index 985aac9..c99803b 100644
> --- a/include/uapi/linux/securebits.h
> +++ b/include/uapi/linux/securebits.h
> @@ -43,9 +43,19 @@
>  #define SECBIT_KEEP_CAPS	(issecure_mask(SECURE_KEEP_CAPS))
>  #define SECBIT_KEEP_CAPS_LOCKED (issecure_mask(SECURE_KEEP_CAPS_LOCKED))
>  
> +/* When set, a process can do PR_SET_MM_EXE_FILE even if it doesn't
> + * have CAP_SYS_RESOURCE. Setting of this bit requires CAP_SYS_RESOURCE.
> + * This bit is not dropped when a task moves in another userns. */
> +#define SECURE_SET_EXE_FILE		6
> +#define SECURE_SET_EXE_FILE_LOCKED	7  /* make bit-6 immutable */
> +
> +#define SECBIT_SET_EXE_FILE	   (issecure_mask(SECURE_SET_EXE_FILE))
> +#define SECBIT_SET_EXE_FILE_LOCKED (issecure_mask(SECURE_SET_EXE_FILE_LOCKED))
> +
>  #define SECURE_ALL_BITS		(issecure_mask(SECURE_NOROOT) | \
>  				 issecure_mask(SECURE_NO_SETUID_FIXUP) | \
> -				 issecure_mask(SECURE_KEEP_CAPS))
> +				 issecure_mask(SECURE_KEEP_CAPS) | \
> +				 issecure_mask(SECURE_SET_EXE_FILE))
>  #define SECURE_ALL_LOCKS	(SECURE_ALL_BITS << 1)
>  
>  #endif /* _UAPI_LINUX_SECUREBITS_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 939370c..2f0925d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -18,6 +18,7 @@
>  #include <linux/kernel.h>
>  #include <linux/workqueue.h>
>  #include <linux/capability.h>
> +#include <linux/securebits.h>
>  #include <linux/device.h>
>  #include <linux/key.h>
>  #include <linux/times.h>
> @@ -1714,6 +1715,10 @@ static int prctl_set_mm(int opt, unsigned long addr,
>  			if (rlimit(RLIMIT_STACK) < RLIM_INFINITY)
>  				return -EPERM;
>  			break;
> +		case PR_SET_MM_EXE_FILE:
> +			if (!issecure(SECURE_SET_EXE_FILE))
> +				return -EPERM;
> +			break;
>  		default:
>  			return -EPERM;
>  		}
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 240fb62..59584fe 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -34,7 +34,8 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>  	/* Start with the same capabilities as init but useless for doing
>  	 * anything as the capabilities are bound to the new user namespace.
>  	 */
> -	cred->securebits = SECUREBITS_DEFAULT;
> +	cred->securebits = SECUREBITS_DEFAULT |
> +				(cred->securebits & SECBIT_SET_EXE_FILE);
>  	cred->cap_inheritable = CAP_EMPTY_SET;
>  	cred->cap_permitted = CAP_FULL_SET;
>  	cred->cap_effective = CAP_FULL_SET;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index b9d613e..eda1eb8 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -907,6 +907,13 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
>  		    )
>  			/* cannot change a locked bit */
>  			goto error;
> +
> +		/* Setting SECURE_SET_EXE_FILE requires CAP_SYS_RESOURCE */
> +		if ((arg2 & SECBIT_SET_EXE_FILE) &&
> +		    !(new->securebits & SECBIT_SET_EXE_FILE) &&
> +		    !capable(CAP_SYS_RESOURCE))
> +			goto error;
> +
>  		new->securebits = arg2;
>  		goto changed;
>  
> -- 
> 1.8.5.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [CRIU] [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack
  2014-02-17  8:34               ` Pavel Emelyanov
  2014-02-17  8:52                 ` Cyrill Gorcunov
@ 2014-03-07 13:51                 ` Pavel Emelyanov
  1 sibling, 0 replies; 21+ messages in thread
From: Pavel Emelyanov @ 2014-03-07 13:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cyrill Gorcunov, Andrew Vagin, Aditya Kali, Stephen Rothwell,
	Oleg Nesterov, linux-kernel, criu, Al Viro, Andrew Morton,
	Kees Cook

Hi, Eric,

>>>> Why can't you have the process of interest do:
>>>> 	ptrace(PTRACE_ATTACHME);
>>>> 	execve(executable, args, ...);
>>>>         
>>>>         /* Have the ptracer inject the recovery/fixup code */
>>>> 	/* Fix up the mostly correct process to look like it has been
>>>>          * executing for a while.
>>>>          */

> 2. What you propose means we have to effectively strace and execve-ing task. As
> compared with plain prlctl this is up to ~600 times slower. I've made such an experiment.

Have you had time to think on the issue? If the prctl restrictions do not work,
what else can it be?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-03-07 13:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-14 14:13 [PATCH RFC 0/3] c/r: add ability to restore mm attributes in a non-root userns Andrey Vagin
2014-02-14 14:13 ` [PATCH 1/3] prctl: reduce permissions to change boundaries of data, brk and stack Andrey Vagin
2014-02-14 16:05   ` Eric W. Biederman
2014-02-14 17:43     ` Andrew Vagin
2014-02-14 18:01       ` [CRIU] " Cyrill Gorcunov
2014-02-14 19:16         ` Eric W. Biederman
2014-02-14 19:47           ` Pavel Emelyanov
2014-02-14 20:06             ` Cyrill Gorcunov
2014-02-14 20:18               ` Eric W. Biederman
2014-02-15  6:29                 ` Cyrill Gorcunov
2014-02-15 23:01                   ` Eric W. Biederman
2014-02-14 20:09             ` Eric W. Biederman
2014-02-17  8:34               ` Pavel Emelyanov
2014-02-17  8:52                 ` Cyrill Gorcunov
2014-02-17 16:57                   ` Pavel Emelyanov
2014-03-07 13:51                 ` Pavel Emelyanov
2014-02-14 20:44           ` Andrey Wagin
2014-02-15 23:05             ` Eric W. Biederman
2014-02-14 14:13 ` [PATCH 2/3] capabilities: add a secure bit to allow changing a task exe link Andrey Vagin
2014-02-18  4:53   ` Serge E. Hallyn
2014-02-14 14:13 ` [PATCH 3/3] prctl: allow to use PR_MM_SET_* which affect only a current task Andrey Vagin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.