* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-09 18:52 ` Serge E. Hallyn
0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-09 18:52 UTC (permalink / raw)
To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman
Cc: linux-man-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Kees Cook,
Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan
Update the capabilities(7) manpage with a description of the
new-ish namespaced file capability support.
A note on userspace tools: since the kernel will automatically
convert between v2 and v3 xattrs, and translate nsroot between
v3 xattrs, we can make do with the current getcap(8) and setcap(8)
tools. I.e. a user on the host can create a transient user namespace
with the appropriate mappings and run setcap(8) there. The kernel
will automatically write a v3 xattr with the transient namespace's
root user as nsroot.
Signed-off-by: Serge Hallyn <shallyn-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
---
man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/man7/capabilities.7 b/man7/capabilities.7
index 166eaaf..76e7e02 100644
--- a/man7/capabilities.7
+++ b/man7/capabilities.7
@@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
then the effective flag must also be specified as enabled
for all other capabilities for which the corresponding permitted or
inheritable flags is enabled.
+.PP
+Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
+the capabilities to be applied to the file, with no record of the writer's
+credentials. Therefore only privileged users can be trusted to write them, and
+.BR CAP_SETFCAP
+over the user namespace which mounted the filesystem (usually the initial user
+namespace) is required. This makes it impossible to write file capabilities
+from a user namespaced container, which causes some package updates to fail.
+.PP
+In order to support setting file capabilities in containers, the
+kernel must be able to identify whether the task executing the
+file will be constrained to a subset of the resources over which
+the writer of the file capabilities has privilege. To this end,
+since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
+of the root user in the writer's namespace ("nsroot"). Hence the writer only
+requires
+.IP 1.
+.BR CAP_SETFCAP
+over the file inode, meaning the writing task must have
+.BR CAP_SETFCAP
+over a user namespace into which the inode's owning user ID is mapped.
+.PP
+and
+.IP 2.
+.BR CAP_SETFCAP
+over the writer's own user namespace.
+.PP
+A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
+whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
+.PP
+Users with the required privilege may use
+.BR setxattr(2)
+to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
+The kernel will automatically convert a VFS_CAP_REVISION_2 to a
+VFS_CAP_REVISION_3 extended attribute with the "nsroot"
+set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
+extended attribute is specified, then the kernel will map the
+specified root user ID (which must be a valid user ID mapped in the caller's
+user namespace) into the initial user namespace. Likewise,
+.BR getxattr(2)
+results will be converted and simplified to show a VFS_CAP_REVISION_2
+extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
+namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
+caller's namespace.
.\"
.SS Transformation of capabilities during execve()
.PP
--
1.9.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-09 18:52 ` Serge E. Hallyn
0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-09 18:52 UTC (permalink / raw)
To: linux-security-module
Update the capabilities(7) manpage with a description of the
new-ish namespaced file capability support.
A note on userspace tools: since the kernel will automatically
convert between v2 and v3 xattrs, and translate nsroot between
v3 xattrs, we can make do with the current getcap(8) and setcap(8)
tools. I.e. a user on the host can create a transient user namespace
with the appropriate mappings and run setcap(8) there. The kernel
will automatically write a v3 xattr with the transient namespace's
root user as nsroot.
Signed-off-by: Serge Hallyn <shallyn@cisco.com>
---
man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/man7/capabilities.7 b/man7/capabilities.7
index 166eaaf..76e7e02 100644
--- a/man7/capabilities.7
+++ b/man7/capabilities.7
@@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
then the effective flag must also be specified as enabled
for all other capabilities for which the corresponding permitted or
inheritable flags is enabled.
+.PP
+Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
+the capabilities to be applied to the file, with no record of the writer's
+credentials. Therefore only privileged users can be trusted to write them, and
+.BR CAP_SETFCAP
+over the user namespace which mounted the filesystem (usually the initial user
+namespace) is required. This makes it impossible to write file capabilities
+from a user namespaced container, which causes some package updates to fail.
+.PP
+In order to support setting file capabilities in containers, the
+kernel must be able to identify whether the task executing the
+file will be constrained to a subset of the resources over which
+the writer of the file capabilities has privilege. To this end,
+since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
+of the root user in the writer's namespace ("nsroot"). Hence the writer only
+requires
+.IP 1.
+.BR CAP_SETFCAP
+over the file inode, meaning the writing task must have
+.BR CAP_SETFCAP
+over a user namespace into which the inode's owning user ID is mapped.
+.PP
+and
+.IP 2.
+.BR CAP_SETFCAP
+over the writer's own user namespace.
+.PP
+A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
+whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
+.PP
+Users with the required privilege may use
+.BR setxattr(2)
+to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
+The kernel will automatically convert a VFS_CAP_REVISION_2 to a
+VFS_CAP_REVISION_3 extended attribute with the "nsroot"
+set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
+extended attribute is specified, then the kernel will map the
+specified root user ID (which must be a valid user ID mapped in the caller's
+user namespace) into the initial user namespace. Likewise,
+.BR getxattr(2)
+results will be converted and simplified to show a VFS_CAP_REVISION_2
+extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
+namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
+caller's namespace.
.\"
.SS Transformation of capabilities during execve()
.PP
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-09 18:52 ` Serge E. Hallyn
@ 2018-01-14 9:40 ` Michael Kerrisk (man-pages)
-1 siblings, 0 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-01-14 9:40 UTC (permalink / raw)
To: Serge E. Hallyn, Eric W. Biederman
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
linux-man-u79uwXL29TY76Z2rM5mHXA, Seth Forshee,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Kees Cook,
Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan
Hello Serge,
On 01/09/2018 07:52 PM, Serge E. Hallyn wrote:
> Update the capabilities(7) manpage with a description of the
> new-ish namespaced file capability support.
Thanks for this patch. I'm trying to craft a modified version
based on your text, so no need to send a new version at this
stage, but I do have some questions below.
> A note on userspace tools: since the kernel will automatically
> convert between v2 and v3 xattrs, and translate nsroot between
> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> tools. I.e. a user on the host can create a transient user namespace
> with the appropriate mappings and run setcap(8) there. The kernel
> will automatically write a v3 xattr with the transient namespace's
> root user as nsroot.
>
> Signed-off-by: Serge Hallyn <shallyn-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
> ---
> man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/man7/capabilities.7 b/man7/capabilities.7
> index 166eaaf..76e7e02 100644
> --- a/man7/capabilities.7
> +++ b/man7/capabilities.7
> @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> then the effective flag must also be specified as enabled
> for all other capabilities for which the corresponding permitted or
> inheritable flags is enabled.
> +.PP
> +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> +the capabilities to be applied to the file, with no record of the writer's
> +credentials. Therefore only privileged users can be trusted to write them, and
> +.BR CAP_SETFCAP
> +over the user namespace which mounted the filesystem (usually the initial user
> +namespace) is required. This makes it impossible to write file capabilities
> +from a user namespaced container, which causes some package updates to fail.
> +.PP
> +In order to support setting file capabilities in containers, the
> +kernel must be able to identify whether the task executing the
> +file will be constrained to a subset of the resources over which
> +the writer of the file capabilities has privilege. To this end,
> +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> +of the root user in the writer's namespace ("nsroot").
Here, "nsroot" means the UID 0 in the namespace as it would be mapped
into the initial userns, right?
> Hence the writer only
> +requires
> +.IP 1.
> +.BR CAP_SETFCAP
> +over the file inode, meaning the writing task must have
> +.BR CAP_SETFCAP
> +over a user namespace into which the inode's owning user ID is mapped.
I don't understand the above line. Could you explain with an example?
Cheers,
Michael
> +.PP
> +and
> +.IP 2.
> +.BR CAP_SETFCAP
> +over the writer's own user namespace.
> +.PP
> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> +.PP
> +Users with the required privilege may use
> +.BR setxattr(2)
> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> +extended attribute is specified, then the kernel will map the
> +specified root user ID (which must be a valid user ID mapped in the caller's
> +user namespace) into the initial user namespace. Likewise,
> +.BR getxattr(2)
> +results will be converted and simplified to show a VFS_CAP_REVISION_2
> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
> +caller's namespace.
> .\"
> .SS Transformation of capabilities during execve()
> .PP
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-14 9:40 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-01-14 9:40 UTC (permalink / raw)
To: linux-security-module
Hello Serge,
On 01/09/2018 07:52 PM, Serge E. Hallyn wrote:
> Update the capabilities(7) manpage with a description of the
> new-ish namespaced file capability support.
Thanks for this patch. I'm trying to craft a modified version
based on your text, so no need to send a new version at this
stage, but I do have some questions below.
> A note on userspace tools: since the kernel will automatically
> convert between v2 and v3 xattrs, and translate nsroot between
> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> tools. I.e. a user on the host can create a transient user namespace
> with the appropriate mappings and run setcap(8) there. The kernel
> will automatically write a v3 xattr with the transient namespace's
> root user as nsroot.
>
> Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> ---
> man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/man7/capabilities.7 b/man7/capabilities.7
> index 166eaaf..76e7e02 100644
> --- a/man7/capabilities.7
> +++ b/man7/capabilities.7
> @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> then the effective flag must also be specified as enabled
> for all other capabilities for which the corresponding permitted or
> inheritable flags is enabled.
> +.PP
> +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> +the capabilities to be applied to the file, with no record of the writer's
> +credentials. Therefore only privileged users can be trusted to write them, and
> +.BR CAP_SETFCAP
> +over the user namespace which mounted the filesystem (usually the initial user
> +namespace) is required. This makes it impossible to write file capabilities
> +from a user namespaced container, which causes some package updates to fail.
> +.PP
> +In order to support setting file capabilities in containers, the
> +kernel must be able to identify whether the task executing the
> +file will be constrained to a subset of the resources over which
> +the writer of the file capabilities has privilege. To this end,
> +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> +of the root user in the writer's namespace ("nsroot").
Here, "nsroot" means the UID 0 in the namespace as it would be mapped
into the initial userns, right?
> Hence the writer only
> +requires
> +.IP 1.
> +.BR CAP_SETFCAP
> +over the file inode, meaning the writing task must have
> +.BR CAP_SETFCAP
> +over a user namespace into which the inode's owning user ID is mapped.
I don't understand the above line. Could you explain with an example?
Cheers,
Michael
> +.PP
> +and
> +.IP 2.
> +.BR CAP_SETFCAP
> +over the writer's own user namespace.
> +.PP
> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> +.PP
> +Users with the required privilege may use
> +.BR setxattr(2)
> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> +extended attribute is specified, then the kernel will map the
> +specified root user ID (which must be a valid user ID mapped in the caller's
> +user namespace) into the initial user namespace. Likewise,
> +.BR getxattr(2)
> +results will be converted and simplified to show a VFS_CAP_REVISION_2
> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
> +caller's namespace.
> .\"
> .SS Transformation of capabilities during execve()
> .PP
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-14 9:40 ` Michael Kerrisk (man-pages)
@ 2018-01-15 4:31 ` Serge E. Hallyn
-1 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-15 4:31 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Serge E. Hallyn, Eric W. Biederman, linux-man, Seth Forshee,
linux-api, linux-security-module, Kees Cook, Andreas Gruenbacher,
Andy Lutomirski, Andrew G. Morgan
Quoting Michael Kerrisk (man-pages) (mtk.manpages@gmail.com):
> Hello Serge,
>
> On 01/09/2018 07:52 PM, Serge E. Hallyn wrote:
> > Update the capabilities(7) manpage with a description of the
> > new-ish namespaced file capability support.
>
> Thanks for this patch. I'm trying to craft a modified version
> based on your text, so no need to send a new version at this
> stage, but I do have some questions below.
Awesome, thanks.
> > A note on userspace tools: since the kernel will automatically
> > convert between v2 and v3 xattrs, and translate nsroot between
> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> > tools. I.e. a user on the host can create a transient user namespace
> > with the appropriate mappings and run setcap(8) there. The kernel
> > will automatically write a v3 xattr with the transient namespace's
> > root user as nsroot.
> >
> > Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> > ---
> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
> > index 166eaaf..76e7e02 100644
> > --- a/man7/capabilities.7
> > +++ b/man7/capabilities.7
> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> > then the effective flag must also be specified as enabled
> > for all other capabilities for which the corresponding permitted or
> > inheritable flags is enabled.
> > +.PP
> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> > +the capabilities to be applied to the file, with no record of the writer's
> > +credentials. Therefore only privileged users can be trusted to write them, and
> > +.BR CAP_SETFCAP
> > +over the user namespace which mounted the filesystem (usually the initial user
> > +namespace) is required. This makes it impossible to write file capabilities
> > +from a user namespaced container, which causes some package updates to fail.
> > +.PP
> > +In order to support setting file capabilities in containers, the
> > +kernel must be able to identify whether the task executing the
> > +file will be constrained to a subset of the resources over which
> > +the writer of the file capabilities has privilege. To this end,
> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> > +of the root user in the writer's namespace ("nsroot").
>
> Here, "nsroot" means the UID 0 in the namespace as it would be mapped
> into the initial userns, right?
Right. If we can come up with a better name that would be great.
> > Hence the writer only
> > +requires
> > +.IP 1.
> > +.BR CAP_SETFCAP
> > +over the file inode, meaning the writing task must have
> > +.BR CAP_SETFCAP
> > +over a user namespace into which the inode's owning user ID is mapped.
>
> I don't understand the above line. Could you explain with an example?
If the file is owned by uid 1000, then uid 1000 can create a new user
ns in which 1000 is mapped to . In this namespace, the new task has
CAP_SETFCAP over the user ns, and 1000 is mapped into the userns (as
0), so the write is allowed.
In the above example, if the xattr being written was v2, then the
actual written xattr will be v3 with nsroot=1000
If the xattr was v3, with nsroot=0, then nsroot=1000 will be written.
If the xattr was v3, with nsroot=500, where 500 is not mapped from
the userns, then the write will be forbidden.
As another allowed case, if I'm uid 1000 and setting up a container
where 100005 is mapped to uid 5; I create a userns where hostuids
100000-165535 map to namespace uids 0-65535, then as root in the
namespace I have CAP_SETFCAP over the namespace, and 100005 is
mapped in the namespace, so I can write to the file.
As a final, nested example: I'm uid 1000 and have uids 100000-300000
as my delegated subuids. I create a container with that full range,
and am running as root there (100000). Now I create a nested container
where 100000-165535 (which are really 200000-265535 on the host) will
be mapped to 0-65535. In its rootfs I write /bin/ping with cap_net_raw=pe
and just for fun make it owned by nested uid 5.
So /bin/ping is owned by
hostuid 200005 = c1 uid 100005 = c2 uid 5
As root in the container I have CAP_SETFCAP over a userns where c2 uid 5
is mapped, so I'm allowed to write a filecap.
If I write it as v2 xattr, then the actual written xattr will be v3 with
nsroot=100000, if I simply write it as root in c1, or nsroot=200000 if
I enter the nested container before writing it.
There are several more options, but let's just pick one - and assume that
as root in the first container (hostuid 100000) I request a v3 xattr
with nsroot=100000. The actual written xattr will ahve nsroot=200000.
now when uid 1000 in the nested container runs /bin/ping, the kernel will
see that that task's user namespace has uid 0 mapped to 200000, and so
the fscap will be honored.
-serge
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-15 4:31 ` Serge E. Hallyn
0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-15 4:31 UTC (permalink / raw)
To: linux-security-module
Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
> Hello Serge,
>
> On 01/09/2018 07:52 PM, Serge E. Hallyn wrote:
> > Update the capabilities(7) manpage with a description of the
> > new-ish namespaced file capability support.
>
> Thanks for this patch. I'm trying to craft a modified version
> based on your text, so no need to send a new version at this
> stage, but I do have some questions below.
Awesome, thanks.
> > A note on userspace tools: since the kernel will automatically
> > convert between v2 and v3 xattrs, and translate nsroot between
> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> > tools. I.e. a user on the host can create a transient user namespace
> > with the appropriate mappings and run setcap(8) there. The kernel
> > will automatically write a v3 xattr with the transient namespace's
> > root user as nsroot.
> >
> > Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> > ---
> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
> > index 166eaaf..76e7e02 100644
> > --- a/man7/capabilities.7
> > +++ b/man7/capabilities.7
> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> > then the effective flag must also be specified as enabled
> > for all other capabilities for which the corresponding permitted or
> > inheritable flags is enabled.
> > +.PP
> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> > +the capabilities to be applied to the file, with no record of the writer's
> > +credentials. Therefore only privileged users can be trusted to write them, and
> > +.BR CAP_SETFCAP
> > +over the user namespace which mounted the filesystem (usually the initial user
> > +namespace) is required. This makes it impossible to write file capabilities
> > +from a user namespaced container, which causes some package updates to fail.
> > +.PP
> > +In order to support setting file capabilities in containers, the
> > +kernel must be able to identify whether the task executing the
> > +file will be constrained to a subset of the resources over which
> > +the writer of the file capabilities has privilege. To this end,
> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> > +of the root user in the writer's namespace ("nsroot").
>
> Here, "nsroot" means the UID 0 in the namespace as it would be mapped
> into the initial userns, right?
Right. If we can come up with a better name that would be great.
> > Hence the writer only
> > +requires
> > +.IP 1.
> > +.BR CAP_SETFCAP
> > +over the file inode, meaning the writing task must have
> > +.BR CAP_SETFCAP
> > +over a user namespace into which the inode's owning user ID is mapped.
>
> I don't understand the above line. Could you explain with an example?
If the file is owned by uid 1000, then uid 1000 can create a new user
ns in which 1000 is mapped to . In this namespace, the new task has
CAP_SETFCAP over the user ns, and 1000 is mapped into the userns (as
0), so the write is allowed.
In the above example, if the xattr being written was v2, then the
actual written xattr will be v3 with nsroot=1000
If the xattr was v3, with nsroot=0, then nsroot=1000 will be written.
If the xattr was v3, with nsroot=500, where 500 is not mapped from
the userns, then the write will be forbidden.
As another allowed case, if I'm uid 1000 and setting up a container
where 100005 is mapped to uid 5; I create a userns where hostuids
100000-165535 map to namespace uids 0-65535, then as root in the
namespace I have CAP_SETFCAP over the namespace, and 100005 is
mapped in the namespace, so I can write to the file.
As a final, nested example: I'm uid 1000 and have uids 100000-300000
as my delegated subuids. I create a container with that full range,
and am running as root there (100000). Now I create a nested container
where 100000-165535 (which are really 200000-265535 on the host) will
be mapped to 0-65535. In its rootfs I write /bin/ping with cap_net_raw=pe
and just for fun make it owned by nested uid 5.
So /bin/ping is owned by
hostuid 200005 = c1 uid 100005 = c2 uid 5
As root in the container I have CAP_SETFCAP over a userns where c2 uid 5
is mapped, so I'm allowed to write a filecap.
If I write it as v2 xattr, then the actual written xattr will be v3 with
nsroot=100000, if I simply write it as root in c1, or nsroot=200000 if
I enter the nested container before writing it.
There are several more options, but let's just pick one - and assume that
as root in the first container (hostuid 100000) I request a v3 xattr
with nsroot=100000. The actual written xattr will ahve nsroot=200000.
now when uid 1000 in the nested container runs /bin/ping, the kernel will
see that that task's user namespace has uid 0 mapped to 200000, and so
the fscap will be honored.
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-09 18:52 ` Serge E. Hallyn
@ 2018-01-16 17:26 ` Jann Horn
-1 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2018-01-16 17:26 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Michael Kerrisk-manpages, Eric W. Biederman, linux-man,
Seth Forshee, Linux API, linux-security-module, Kees Cook,
Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan
On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Update the capabilities(7) manpage with a description of the
> new-ish namespaced file capability support.
>
> A note on userspace tools: since the kernel will automatically
> convert between v2 and v3 xattrs, and translate nsroot between
> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> tools. I.e. a user on the host can create a transient user namespace
> with the appropriate mappings and run setcap(8) there. The kernel
> will automatically write a v3 xattr with the transient namespace's
> root user as nsroot.
>
> Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> ---
> man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/man7/capabilities.7 b/man7/capabilities.7
> index 166eaaf..76e7e02 100644
> --- a/man7/capabilities.7
> +++ b/man7/capabilities.7
> @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> then the effective flag must also be specified as enabled
> for all other capabilities for which the corresponding permitted or
> inheritable flags is enabled.
> +.PP
> +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> +the capabilities to be applied to the file, with no record of the writer's
> +credentials. Therefore only privileged users can be trusted to write them, and
> +.BR CAP_SETFCAP
> +over the user namespace which mounted the filesystem (usually the initial user
> +namespace) is required. This makes it impossible to write file capabilities
> +from a user namespaced container, which causes some package updates to fail.
> +.PP
> +In order to support setting file capabilities in containers, the
> +kernel must be able to identify whether the task executing the
> +file will be constrained to a subset of the resources over which
> +the writer of the file capabilities has privilege. To this end,
> +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> +of the root user in the writer's namespace ("nsroot"). Hence the writer only
> +requires
> +.IP 1.
> +.BR CAP_SETFCAP
> +over the file inode, meaning the writing task must have
> +.BR CAP_SETFCAP
> +over a user namespace into which the inode's owning user ID is mapped.
> +.PP
> +and
> +.IP 2.
> +.BR CAP_SETFCAP
> +over the writer's own user namespace.
I think that the following would be clearer (but technically
equivalent): "Hence the writer only requires CAP_SETFCAP over the file
inode, meaning that the writing task must have CAP_SETFCAP in its own
user namespace and the UID and GID of the file inode must be mapped in
the writing task's user namespace.".
> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> +.PP
> +Users with the required privilege may use
> +.BR setxattr(2)
> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> +extended attribute is specified, then the kernel will map the
> +specified root user ID (which must be a valid user ID mapped in the caller's
> +user namespace) into the initial user namespace.
Really, "into the initial user namespace"? That may be true for the
kernel-internal representation, but the on-disk representation is the
mapping into the user namespace that contains the mount namespace into
which the file system was mounted, right? This would become observable
when a file system is mounted in a different namespace than before, or
when working with FUSE in a namespace.
> Likewise,
> +.BR getxattr(2)
> +results will be converted and simplified to show a VFS_CAP_REVISION_2
> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
> +caller's namespace.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-16 17:26 ` Jann Horn
0 siblings, 0 replies; 22+ messages in thread
From: Jann Horn @ 2018-01-16 17:26 UTC (permalink / raw)
To: linux-security-module
On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Update the capabilities(7) manpage with a description of the
> new-ish namespaced file capability support.
>
> A note on userspace tools: since the kernel will automatically
> convert between v2 and v3 xattrs, and translate nsroot between
> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> tools. I.e. a user on the host can create a transient user namespace
> with the appropriate mappings and run setcap(8) there. The kernel
> will automatically write a v3 xattr with the transient namespace's
> root user as nsroot.
>
> Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> ---
> man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/man7/capabilities.7 b/man7/capabilities.7
> index 166eaaf..76e7e02 100644
> --- a/man7/capabilities.7
> +++ b/man7/capabilities.7
> @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> then the effective flag must also be specified as enabled
> for all other capabilities for which the corresponding permitted or
> inheritable flags is enabled.
> +.PP
> +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> +the capabilities to be applied to the file, with no record of the writer's
> +credentials. Therefore only privileged users can be trusted to write them, and
> +.BR CAP_SETFCAP
> +over the user namespace which mounted the filesystem (usually the initial user
> +namespace) is required. This makes it impossible to write file capabilities
> +from a user namespaced container, which causes some package updates to fail.
> +.PP
> +In order to support setting file capabilities in containers, the
> +kernel must be able to identify whether the task executing the
> +file will be constrained to a subset of the resources over which
> +the writer of the file capabilities has privilege. To this end,
> +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> +of the root user in the writer's namespace ("nsroot"). Hence the writer only
> +requires
> +.IP 1.
> +.BR CAP_SETFCAP
> +over the file inode, meaning the writing task must have
> +.BR CAP_SETFCAP
> +over a user namespace into which the inode's owning user ID is mapped.
> +.PP
> +and
> +.IP 2.
> +.BR CAP_SETFCAP
> +over the writer's own user namespace.
I think that the following would be clearer (but technically
equivalent): "Hence the writer only requires CAP_SETFCAP over the file
inode, meaning that the writing task must have CAP_SETFCAP in its own
user namespace and the UID and GID of the file inode must be mapped in
the writing task's user namespace.".
> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> +.PP
> +Users with the required privilege may use
> +.BR setxattr(2)
> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> +extended attribute is specified, then the kernel will map the
> +specified root user ID (which must be a valid user ID mapped in the caller's
> +user namespace) into the initial user namespace.
Really, "into the initial user namespace"? That may be true for the
kernel-internal representation, but the on-disk representation is the
mapping into the user namespace that contains the mount namespace into
which the file system was mounted, right? This would become observable
when a file system is mounted in a different namespace than before, or
when working with FUSE in a namespace.
> Likewise,
> +.BR getxattr(2)
> +results will be converted and simplified to show a VFS_CAP_REVISION_2
> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
> +caller's namespace.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-16 17:26 ` Jann Horn
@ 2018-01-16 17:38 ` Serge E. Hallyn
-1 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-16 17:38 UTC (permalink / raw)
To: Jann Horn
Cc: Serge E. Hallyn, Michael Kerrisk-manpages, Eric W. Biederman,
linux-man, Seth Forshee, Linux API, linux-security-module,
Kees Cook, Andreas Gruenbacher, Andy Lutomirski,
Andrew G. Morgan
Quoting Jann Horn (jannh@google.com):
> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Update the capabilities(7) manpage with a description of the
> > new-ish namespaced file capability support.
> >
> > A note on userspace tools: since the kernel will automatically
> > convert between v2 and v3 xattrs, and translate nsroot between
> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> > tools. I.e. a user on the host can create a transient user namespace
> > with the appropriate mappings and run setcap(8) there. The kernel
> > will automatically write a v3 xattr with the transient namespace's
> > root user as nsroot.
> >
> > Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> > ---
> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
> > index 166eaaf..76e7e02 100644
> > --- a/man7/capabilities.7
> > +++ b/man7/capabilities.7
> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> > then the effective flag must also be specified as enabled
> > for all other capabilities for which the corresponding permitted or
> > inheritable flags is enabled.
> > +.PP
> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> > +the capabilities to be applied to the file, with no record of the writer's
> > +credentials. Therefore only privileged users can be trusted to write them, and
> > +.BR CAP_SETFCAP
> > +over the user namespace which mounted the filesystem (usually the initial user
> > +namespace) is required. This makes it impossible to write file capabilities
> > +from a user namespaced container, which causes some package updates to fail.
> > +.PP
> > +In order to support setting file capabilities in containers, the
> > +kernel must be able to identify whether the task executing the
> > +file will be constrained to a subset of the resources over which
> > +the writer of the file capabilities has privilege. To this end,
> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> > +of the root user in the writer's namespace ("nsroot"). Hence the writer only
> > +requires
> > +.IP 1.
> > +.BR CAP_SETFCAP
> > +over the file inode, meaning the writing task must have
> > +.BR CAP_SETFCAP
> > +over a user namespace into which the inode's owning user ID is mapped.
> > +.PP
> > +and
> > +.IP 2.
> > +.BR CAP_SETFCAP
> > +over the writer's own user namespace.
>
> I think that the following would be clearer (but technically
> equivalent): "Hence the writer only requires CAP_SETFCAP over the file
> inode, meaning that the writing task must have CAP_SETFCAP in its own
> user namespace and the UID and GID of the file inode must be mapped in
> the writing task's user namespace.".
Looks good to me.
> > +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> > +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> > +.PP
> > +Users with the required privilege may use
> > +.BR setxattr(2)
> > +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> > +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> > +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> > +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> > +extended attribute is specified, then the kernel will map the
> > +specified root user ID (which must be a valid user ID mapped in the caller's
> > +user namespace) into the initial user namespace.
>
> Really, "into the initial user namespace"? That may be true for the
> kernel-internal representation, but the on-disk representation is the
> mapping into the user namespace that contains the mount namespace into
> which the file system was mounted, right?
Ah, yes, it is.
> This would become observable
> when a file system is mounted in a different namespace than before, or
> when working with FUSE in a namespace.
Yes it would.
Michael, you said you were reworking it, do you mind working this into
it as well?
thanks Jann,
-serge
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-16 17:38 ` Serge E. Hallyn
0 siblings, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-01-16 17:38 UTC (permalink / raw)
To: linux-security-module
Quoting Jann Horn (jannh at google.com):
> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Update the capabilities(7) manpage with a description of the
> > new-ish namespaced file capability support.
> >
> > A note on userspace tools: since the kernel will automatically
> > convert between v2 and v3 xattrs, and translate nsroot between
> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> > tools. I.e. a user on the host can create a transient user namespace
> > with the appropriate mappings and run setcap(8) there. The kernel
> > will automatically write a v3 xattr with the transient namespace's
> > root user as nsroot.
> >
> > Signed-off-by: Serge Hallyn <shallyn@cisco.com>
> > ---
> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 44 insertions(+)
> >
> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
> > index 166eaaf..76e7e02 100644
> > --- a/man7/capabilities.7
> > +++ b/man7/capabilities.7
> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
> > then the effective flag must also be specified as enabled
> > for all other capabilities for which the corresponding permitted or
> > inheritable flags is enabled.
> > +.PP
> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
> > +the capabilities to be applied to the file, with no record of the writer's
> > +credentials. Therefore only privileged users can be trusted to write them, and
> > +.BR CAP_SETFCAP
> > +over the user namespace which mounted the filesystem (usually the initial user
> > +namespace) is required. This makes it impossible to write file capabilities
> > +from a user namespaced container, which causes some package updates to fail.
> > +.PP
> > +In order to support setting file capabilities in containers, the
> > +kernel must be able to identify whether the task executing the
> > +file will be constrained to a subset of the resources over which
> > +the writer of the file capabilities has privilege. To this end,
> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
> > +of the root user in the writer's namespace ("nsroot"). Hence the writer only
> > +requires
> > +.IP 1.
> > +.BR CAP_SETFCAP
> > +over the file inode, meaning the writing task must have
> > +.BR CAP_SETFCAP
> > +over a user namespace into which the inode's owning user ID is mapped.
> > +.PP
> > +and
> > +.IP 2.
> > +.BR CAP_SETFCAP
> > +over the writer's own user namespace.
>
> I think that the following would be clearer (but technically
> equivalent): "Hence the writer only requires CAP_SETFCAP over the file
> inode, meaning that the writing task must have CAP_SETFCAP in its own
> user namespace and the UID and GID of the file inode must be mapped in
> the writing task's user namespace.".
Looks good to me.
> > +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> > +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> > +.PP
> > +Users with the required privilege may use
> > +.BR setxattr(2)
> > +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> > +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> > +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> > +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> > +extended attribute is specified, then the kernel will map the
> > +specified root user ID (which must be a valid user ID mapped in the caller's
> > +user namespace) into the initial user namespace.
>
> Really, "into the initial user namespace"? That may be true for the
> kernel-internal representation, but the on-disk representation is the
> mapping into the user namespace that contains the mount namespace into
> which the file system was mounted, right?
Ah, yes, it is.
> This would become observable
> when a file system is mounted in a different namespace than before, or
> when working with FUSE in a namespace.
Yes it would.
Michael, you said you were reworking it, do you mind working this into
it as well?
thanks Jann,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-16 17:38 ` Serge E. Hallyn
@ 2018-01-17 23:44 ` Michael Kerrisk (man-pages)
-1 siblings, 0 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-01-17 23:44 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Jann Horn, Eric W. Biederman, linux-man, Seth Forshee, Linux API,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Kees Cook,
Andreas Gruenbacher, Andy Lutomirski, Andrew G. Morgan
On 16 January 2018 at 18:38, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Jann Horn (jannh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Update the capabilities(7) manpage with a description of the
>> > new-ish namespaced file capability support.
>> >
>> > A note on userspace tools: since the kernel will automatically
>> > convert between v2 and v3 xattrs, and translate nsroot between
>> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
>> > tools. I.e. a user on the host can create a transient user namespace
>> > with the appropriate mappings and run setcap(8) there. The kernel
>> > will automatically write a v3 xattr with the transient namespace's
>> > root user as nsroot.
>> >
>> > Signed-off-by: Serge Hallyn <shallyn-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
>> > ---
>> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 44 insertions(+)
>> >
>> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
>> > index 166eaaf..76e7e02 100644
>> > --- a/man7/capabilities.7
>> > +++ b/man7/capabilities.7
>> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
>> > then the effective flag must also be specified as enabled
>> > for all other capabilities for which the corresponding permitted or
>> > inheritable flags is enabled.
>> > +.PP
>> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
>> > +the capabilities to be applied to the file, with no record of the writer's
>> > +credentials. Therefore only privileged users can be trusted to write them, and
>> > +.BR CAP_SETFCAP
>> > +over the user namespace which mounted the filesystem (usually the initial user
>> > +namespace) is required. This makes it impossible to write file capabilities
>> > +from a user namespaced container, which causes some package updates to fail.
>> > +.PP
>> > +In order to support setting file capabilities in containers, the
>> > +kernel must be able to identify whether the task executing the
>> > +file will be constrained to a subset of the resources over which
>> > +the writer of the file capabilities has privilege. To this end,
>> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
>> > +of the root user in the writer's namespace ("nsroot"). Hence the writer only
>> > +requires
>> > +.IP 1.
>> > +.BR CAP_SETFCAP
>> > +over the file inode, meaning the writing task must have
>> > +.BR CAP_SETFCAP
>> > +over a user namespace into which the inode's owning user ID is mapped.
>> > +.PP
>> > +and
>> > +.IP 2.
>> > +.BR CAP_SETFCAP
>> > +over the writer's own user namespace.
>>
>> I think that the following would be clearer (but technically
>> equivalent): "Hence the writer only requires CAP_SETFCAP over the file
>> inode, meaning that the writing task must have CAP_SETFCAP in its own
>> user namespace and the UID and GID of the file inode must be mapped in
>> the writing task's user namespace.".
>
> Looks good to me.
>
>> > +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>> > +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>> > +.PP
>> > +Users with the required privilege may use
>> > +.BR setxattr(2)
>> > +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>> > +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>> > +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>> > +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>> > +extended attribute is specified, then the kernel will map the
>> > +specified root user ID (which must be a valid user ID mapped in the caller's
>> > +user namespace) into the initial user namespace.
>>
>> Really, "into the initial user namespace"? That may be true for the
>> kernel-internal representation, but the on-disk representation is the
>> mapping into the user namespace that contains the mount namespace into
>> which the file system was mounted, right?
>
> Ah, yes, it is.
>
>> This would become observable
>> when a file system is mounted in a different namespace than before, or
>> when working with FUSE in a namespace.
>
> Yes it would.
>
> Michael, you said you were reworking it, do you mind working this into
> it as well?
Yes, I'll do that. It may be a couple of weeks before I get some more
cycles for this, however.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
@ 2018-01-17 23:44 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-01-17 23:44 UTC (permalink / raw)
To: linux-security-module
On 16 January 2018 at 18:38, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Jann Horn (jannh at google.com):
>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Update the capabilities(7) manpage with a description of the
>> > new-ish namespaced file capability support.
>> >
>> > A note on userspace tools: since the kernel will automatically
>> > convert between v2 and v3 xattrs, and translate nsroot between
>> > v3 xattrs, we can make do with the current getcap(8) and setcap(8)
>> > tools. I.e. a user on the host can create a transient user namespace
>> > with the appropriate mappings and run setcap(8) there. The kernel
>> > will automatically write a v3 xattr with the transient namespace's
>> > root user as nsroot.
>> >
>> > Signed-off-by: Serge Hallyn <shallyn@cisco.com>
>> > ---
>> > man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 44 insertions(+)
>> >
>> > diff --git a/man7/capabilities.7 b/man7/capabilities.7
>> > index 166eaaf..76e7e02 100644
>> > --- a/man7/capabilities.7
>> > +++ b/man7/capabilities.7
>> > @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
>> > then the effective flag must also be specified as enabled
>> > for all other capabilities for which the corresponding permitted or
>> > inheritable flags is enabled.
>> > +.PP
>> > +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
>> > +the capabilities to be applied to the file, with no record of the writer's
>> > +credentials. Therefore only privileged users can be trusted to write them, and
>> > +.BR CAP_SETFCAP
>> > +over the user namespace which mounted the filesystem (usually the initial user
>> > +namespace) is required. This makes it impossible to write file capabilities
>> > +from a user namespaced container, which causes some package updates to fail.
>> > +.PP
>> > +In order to support setting file capabilities in containers, the
>> > +kernel must be able to identify whether the task executing the
>> > +file will be constrained to a subset of the resources over which
>> > +the writer of the file capabilities has privilege. To this end,
>> > +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
>> > +of the root user in the writer's namespace ("nsroot"). Hence the writer only
>> > +requires
>> > +.IP 1.
>> > +.BR CAP_SETFCAP
>> > +over the file inode, meaning the writing task must have
>> > +.BR CAP_SETFCAP
>> > +over a user namespace into which the inode's owning user ID is mapped.
>> > +.PP
>> > +and
>> > +.IP 2.
>> > +.BR CAP_SETFCAP
>> > +over the writer's own user namespace.
>>
>> I think that the following would be clearer (but technically
>> equivalent): "Hence the writer only requires CAP_SETFCAP over the file
>> inode, meaning that the writing task must have CAP_SETFCAP in its own
>> user namespace and the UID and GID of the file inode must be mapped in
>> the writing task's user namespace.".
>
> Looks good to me.
>
>> > +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>> > +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>> > +.PP
>> > +Users with the required privilege may use
>> > +.BR setxattr(2)
>> > +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>> > +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>> > +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>> > +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>> > +extended attribute is specified, then the kernel will map the
>> > +specified root user ID (which must be a valid user ID mapped in the caller's
>> > +user namespace) into the initial user namespace.
>>
>> Really, "into the initial user namespace"? That may be true for the
>> kernel-internal representation, but the on-disk representation is the
>> mapping into the user namespace that contains the mount namespace into
>> which the file system was mounted, right?
>
> Ah, yes, it is.
>
>> This would become observable
>> when a file system is mounted in a different namespace than before, or
>> when working with FUSE in a namespace.
>
> Yes it would.
>
> Michael, you said you were reworking it, do you mind working this into
> it as well?
Yes, I'll do that. It may be a couple of weeks before I get some more
cycles for this, however.
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-16 17:26 ` Jann Horn
(?)
(?)
@ 2018-04-13 19:26 ` Michael Kerrisk (man-pages)
2018-04-16 14:10 ` Jann Horn
2018-04-20 0:04 ` Serge E. Hallyn
-1 siblings, 2 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-04-13 19:26 UTC (permalink / raw)
To: linux-security-module
Hello Serge, Jann,
On 01/16/2018 06:26 PM, Jann Horn wrote:
> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> Update the capabilities(7) manpage with a description of the
>> new-ish namespaced file capability support.
>>
>> A note on userspace tools: since the kernel will automatically
>> convert between v2 and v3 xattrs, and translate nsroot between
>> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
>> tools. I.e. a user on the host can create a transient user namespace
>> with the appropriate mappings and run setcap(8) there. The kernel
>> will automatically write a v3 xattr with the transient namespace's
>> root user as nsroot.
After a long gap, I have come back to the task of working up
some text to describe file capability versioning and namespaced file
capabilities.
I still not convinced I've captured things correctly, and I still
have a few questions (see below). But first, here's the text that
I have so far (suggestions for improvements welcome). These changes
have already been pushed to the Git repo.
File capability mask versioning
To allow extensibility, the kernel supports a scheme to encode
a version number inside the security.capability extended
attribute that is used to implement file capabilities. These
version numbers are internal to the implementation, and not
directly visible to user-space applications. To date, the fol?
lowing versions are supported:
VFS_CAP_REVISION_1
This was the original file capability implementation,
which supported 32-bit masks for file capabilities.
VFS_CAP_REVISION_2 (since Linux 2.6.25)
This version allows for file capability masks that are
64 bits in size, and was necessary as the number of sup?
ported capabilities grew beyond 32. The kernel trans?
parently continues to support the execution of files
that have 32-bit version 1 capability masks, but when
adding capabilities to files that did not previously
have capabilities, or modifying the capabilities of
existing files, it automatically uses the version 2
scheme (or possibly the version 3 scheme, as described
below).
VFS_CAP_REVISION_3 (since Linux 4.14)
Version 3 file capabilities are provided to support
namespaced file capabilities (described below).
As with version 2 file capabilities, version 3 capabil?
ity masks are 64 bits in size. But in addition, the
root user ID of namespace is encoded in the secu?
rity.capability extended attribute. (A namespace's root
user ID is the value that user ID 0 inside that names?
pace maps to in the initial user namespace.)
["namespace root user ID" is my term for what Serge called nsroot.
I think it's a little more meaningful, but I am also open to suggestions
for a better term.]
Version 3 file capabilities are designed to coexist with
version 2 capabilities; that is, on a modern Linux sys?
tem, there may be some files with version 2 capabilities
while others have version 3 capabilities.
Before Linux 4.14, the only kind of capability mask that could
be attached to a file was a VFS_CAP_REVISION_2 mask. Since
Linux 4.14, the version of the capability mask that is attached
to a file depends on the circumstances in which the secu?
rity.capability extended attribute was created.
Starting with Linux 4.14, a security.capability extended
attribute is automatically created as (or converted to) a ver?
sion 3 (VFS_CAP_REVISION_3) attribute if both of the following
are true:
(1) The thread writing the attribute resides in a noninitial
namespace. (More precisely: the thread resides in a user
namespace other than the one from which the underlying
filesystem was mounted.)
(2) The thread has the CAP_SETFCAP capability over the file
inode, meaning that (a) the thread has the CAP_SETFCAP
capability in its own user namespace; and (b) the UID and
GID of the file inode have mappings in the writer's user
namespace.
???????????????????????????????????????????????????????
?FIXME ?
???????????????????????????????????????????????????????
?Does there also need to be some kind of credential ?
?match between the file and the namespace creator ?
?UID? ?
???????????????????????????????????????????????????????
When a VFS_CAP_REVISION_3 security.capability extended
attribute is created, the root user ID of the creating thread's
user namespace is saved in the extended attribute.
By contrast, creating a security.capability extended attribute
from a privileged (CAP_SETFCAP) thread that resides in the
namespace where the the underlying filesystem was mounted (this
normally means the initial user namespace) automatically
results in a version 2 (VFS_CAP_REVISION_2) attribute.
Note that a file can have either a version 2 or a version 3
security.capability extended attribute associated with it, but
not both: creation or modification of the security.capability
extended attribute will automatically modify the version
according to the circumstances in which the extended attribute
is created or modified.
[...]
Namespaced file capabilities
Traditional (i.e., version 2) file capabilities associate only
a set of capability masks with a binary executable file. When
a process executes a binary with such capabilities, it gains
the associated capabilities (within its user namespace) as per
the rules described above in "Transformation of capabilities
during execve()".
Because version 2 file capabilities confer capabilities to the
executing process regardless of which user namespace it resides
in, only privileged processes are permitted to associate capa?
bilities with a file. Here, "privileged" means a process that
has the CAP_SETFCAP capability in the user namespace where the
filesystem was mounted (normally the initial user namespace).
This limitation renders file capabilities useless for certain
use cases. For example, in user-namespaced containers, it can
be desirable to be able to create a binary that confers capa?
bilities only to processes executed inside that container, but
not to processes that are executed outside the container.
Linux 4.14 added so-called namespaced file capabilities to sup?
port such use cases. Namespaced file capabilities are recorded
as version 3 (i.e., VFS_CAP_REVISION_3) security.capability
extended attributes. Such an attribute is automatically cre?
ated when a process that resides in a noninitial user namespace
associates (setxattr(2)) file capabilities with a file whose
user ID matches the user ID of the creator of the namespace.
In this case, the kernel records not just the capability masks
in the extended attribute, but also the namespace root user ID.
For further details, see File capability mask versioning,
above.
As with a binary that has VFS_CAP_REVISION_2 file capabilities,
a binary with VFS_CAP_REVISION_3 file capabilities confers
capabilities to a process during execve(). However, capabili?
ties are conferred only if the binary is executed by a process
that resides in a user namespace whose UID 0 maps to the root
user ID that is saved in the extended attribute, or when exe?
cuted by a process that resides in descendant of such a names?
pace.
The following is Serge's original patch, with some questions
from me.
>> Signed-off-by: Serge Hallyn <shallyn@cisco.com>
>> ---
>> man7/capabilities.7 | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 44 insertions(+)
>>
>> diff --git a/man7/capabilities.7 b/man7/capabilities.7
>> index 166eaaf..76e7e02 100644
>> --- a/man7/capabilities.7
>> +++ b/man7/capabilities.7
>> @@ -936,6 +936,50 @@ if we specify the effective flag as being enabled for any capability,
>> then the effective flag must also be specified as enabled
>> for all other capabilities for which the corresponding permitted or
>> inheritable flags is enabled.
>> +.PP
>> +Until 4.13, only VFS_CAP_REVISION_2 xattrs were supported. These store only
>> +the capabilities to be applied to the file, with no record of the writer's
>> +credentials. Therefore only privileged users can be trusted to write them, and
>> +.BR CAP_SETFCAP
>> +over the user namespace which mounted the filesystem (usually the initial user
>> +namespace) is required. This makes it impossible to write file capabilities
>> +from a user namespaced container, which causes some package updates to fail.
>> +.PP
>> +In order to support setting file capabilities in containers, the
>> +kernel must be able to identify whether the task executing the
>> +file will be constrained to a subset of the resources over which
>> +the writer of the file capabilities has privilege. To this end,
>> +since 4.13, VFS_CAP_REVISION_3 capabilities store the user ID
>> +of the root user in the writer's namespace ("nsroot"). Hence the writer only
>> +requires
>> +.IP 1.
>> +.BR CAP_SETFCAP
>> +over the file inode, meaning the writing task must have
>> +.BR CAP_SETFCAP
>> +over a user namespace into which the inode's owning user ID is mapped.
>> +.PP
>> +and
>> +.IP 2.
>> +.BR CAP_SETFCAP
>> +over the writer's own user namespace.
>
> I think that the following would be clearer (but technically
> equivalent): "Hence the writer only requires CAP_SETFCAP over the file
> inode, meaning that the writing task must have CAP_SETFCAP in its own
> user namespace and the UID and GID of the file inode must be mapped in
> the writing task's user namespace.".
I've tried to capture that idea in my text above. Was I successful?
>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>> +.PP
>> +Users with the required privilege may use
>> +.BR setxattr(2)
>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>> +extended attribute is specified, then the kernel will map the
>> +specified root user ID (which must be a valid user ID mapped in the caller's
>> +user namespace) into the initial user namespace.
>
> Really, "into the initial user namespace"? That may be true for the
> kernel-internal representation, but the on-disk representation is the
> mapping into the user namespace that contains the mount namespace into
> which the file system was mounted, right? This would become observable
> when a file system is mounted in a different namespace than before, or
> when working with FUSE in a namespace.
>
>> Likewise,
>> +.BR getxattr(2)
>> +results will be converted and simplified to show a VFS_CAP_REVISION_2
>> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
>> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
>> +caller's namespace.
I haven't captured that last paragraph in my text. I'm not sure I
understand the idea being presented. Serge, could you elaborate?
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-01-16 17:38 ` Serge E. Hallyn
(?)
(?)
@ 2018-04-13 19:29 ` Michael Kerrisk (man-pages)
2018-04-15 19:22 ` Serge E. Hallyn
-1 siblings, 1 reply; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-04-13 19:29 UTC (permalink / raw)
To: linux-security-module
On 01/16/2018 06:38 PM, Serge E. Hallyn wrote:
> Quoting Jann Horn (jannh at google.com):
>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
[...]
>>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>>> +.PP
>>> +Users with the required privilege may use
>>> +.BR setxattr(2)
>>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>>> +extended attribute is specified, then the kernel will map the
>>> +specified root user ID (which must be a valid user ID mapped in the caller's
>>> +user namespace) into the initial user namespace.
>>
>> Really, "into the initial user namespace"? That may be true for the
>> kernel-internal representation, but the on-disk representation is the
>> mapping into the user namespace that contains the mount namespace into
>> which the file system was mounted, right?
>
> Ah, yes, it is.
>
>> This would become observable
>> when a file system is mounted in a different namespace than before, or
>> when working with FUSE in a namespace.
>
> Yes it would.
>
> Michael, you said you were reworking it, do you mind working this into
> it as well?
So, I must confess that I don't really understand this piece of the
conversation--neither Jann's comments nor Serge's response (Serge, are
you saying Jann is right or wrong in his comments?). Perhaps this can
be clarified as a response to the man page text in the other mail I
just sent?
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-13 19:29 ` Michael Kerrisk (man-pages)
@ 2018-04-15 19:22 ` Serge E. Hallyn
2018-04-22 16:46 ` Michael Kerrisk (man-pages)
0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2018-04-15 19:22 UTC (permalink / raw)
To: linux-security-module
Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
> On 01/16/2018 06:38 PM, Serge E. Hallyn wrote:
> > Quoting Jann Horn (jannh at google.com):
> >> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> [...]
>
> >>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> >>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> >>> +.PP
> >>> +Users with the required privilege may use
> >>> +.BR setxattr(2)
> >>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> >>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> >>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> >>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> >>> +extended attribute is specified, then the kernel will map the
> >>> +specified root user ID (which must be a valid user ID mapped in the caller's
> >>> +user namespace) into the initial user namespace.
> >>
> >> Really, "into the initial user namespace"? That may be true for the
> >> kernel-internal representation, but the on-disk representation is the
> >> mapping into the user namespace that contains the mount namespace into
> >> which the file system was mounted, right?
> >
> > Ah, yes, it is.
> >
> >> This would become observable
> >> when a file system is mounted in a different namespace than before, or
> >> when working with FUSE in a namespace.
> >
> > Yes it would.
> >
> > Michael, you said you were reworking it, do you mind working this into
> > it as well?
>
> So, I must confess that I don't really understand this piece of the
> conversation--neither Jann's comments nor Serge's response (Serge, are
> you saying Jann is right or wrong in his comments?). Perhaps this can
He's right. The point is that if a filesystem is mounted by a user in
a non-init user namespace, then the kernel will map the specified root user ID
into sb->sb_user_ns, not &init_user_ns.
> be clarified as a response to the man page text in the other mail I
> just sent?
Yes, I'll try to do that.
thanks,
Serge
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-13 19:26 ` Michael Kerrisk (man-pages)
@ 2018-04-16 14:10 ` Jann Horn
2018-04-19 23:57 ` Serge E. Hallyn
2018-05-04 15:10 ` Michael Kerrisk (man-pages)
2018-04-20 0:04 ` Serge E. Hallyn
1 sibling, 2 replies; 22+ messages in thread
From: Jann Horn @ 2018-04-16 14:10 UTC (permalink / raw)
To: linux-security-module
On Fri, Apr 13, 2018 at 9:26 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hello Serge, Jann,
>
> On 01/16/2018 06:26 PM, Jann Horn wrote:
>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
[...]
> Starting with Linux 4.14, a security.capability extended
> attribute is automatically created as (or converted to) a ver?
> sion 3 (VFS_CAP_REVISION_3) attribute if both of the following
> are true:
>
> (1) The thread writing the attribute resides in a noninitial
> namespace.
I'm not entirely happy with this - while under most circumstances
(especially nowadays) correct, isn't this going to confuse readers who
want to understand the actual rules?
> (More precisely: the thread resides in a user
> namespace other than the one from which the underlying
> filesystem was mounted.)
I think if you're in a parent namespace of the user namespace that
mounted the filesystem, you actually can write a VFS_CAP_REVISION_2
attribute?
> (2) The thread has the CAP_SETFCAP capability over the file
> inode, meaning that (a) the thread has the CAP_SETFCAP
> capability in its own user namespace; and (b) the UID and
> GID of the file inode have mappings in the writer's user
> namespace.
> ???????????????????????????????????????????????????????
> ?FIXME ?
> ???????????????????????????????????????????????????????
> ?Does there also need to be some kind of credential ?
> ?match between the file and the namespace creator ?
> ?UID? ?
> ???????????????????????????????????????????????????????
The namespace creator UID (iow, the namespace owner) is irrelevant
here; the capability model is somewhat inconsistent here. Normal
capability checks that go down to cap_capable() (like ns_capable())
grant all privileges to processes in parent namespaces that have an
EUID that matches the owner UID of one of the intermediate namespaces,
including the target namespace. But capable_wrt_inode_uidgid() always
requires the caller to have the specified capability in its own
namespace because, when operating on an inode, the concept of an
implicit "target namespace" doesn't really exist. (For a properly
consistent model, you'd probably need to let the caller explicity
specify the target namespace, but then that would somewhat break the
transparency of namespaces.) cap_convert_nscap() starts by checking
for capable_wrt_inode_uidgid().
[...]
> As with a binary that has VFS_CAP_REVISION_2 file capabilities,
> a binary with VFS_CAP_REVISION_3 file capabilities confers
> capabilities to a process during execve(). However, capabili?
> ties are conferred only if the binary is executed by a process
> that resides in a user namespace whose UID 0 maps to the root
> user ID that is saved in the extended attribute, or when exe?
> cuted by a process that resides in descendant of such a names?
Nit: "in a descendant"?
[...]
>>> Likewise,
>>> +.BR getxattr(2)
>>> +results will be converted and simplified to show a VFS_CAP_REVISION_2
>>> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
>>> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
>>> +caller's namespace.
>
> I haven't captured that last paragraph in my text. I'm not sure I
> understand the idea being presented. Serge, could you elaborate?
Summary: When you read a capability attribute with getxattr(), the
kernel will rewrite the returned value such that it looks the way it
would have to look if the filesystem was mounted in your user
namespace; just like how, when the attribute is written, the caller
provides an attribute value written as if the filesystem was mounted
in the caller's user namespace.
Conceptually, this is mostly the same as the UID conversions applied
by chown() and stat().
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-16 14:10 ` Jann Horn
@ 2018-04-19 23:57 ` Serge E. Hallyn
2018-05-04 15:10 ` Michael Kerrisk (man-pages)
1 sibling, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-04-19 23:57 UTC (permalink / raw)
To: linux-security-module
Quoting Jann Horn (jannh at google.com):
> On Fri, Apr 13, 2018 at 9:26 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > Hello Serge, Jann,
> [...]
> >>> Likewise,
> >>> +.BR getxattr(2)
> >>> +results will be converted and simplified to show a VFS_CAP_REVISION_2
> >>> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
> >>> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
> >>> +caller's namespace.
> >
> > I haven't captured that last paragraph in my text. I'm not sure I
> > understand the idea being presented. Serge, could you elaborate?
>
> Summary: When you read a capability attribute with getxattr(), the
> kernel will rewrite the returned value such that it looks the way it
> would have to look if the filesystem was mounted in your user
> namespace; just like how, when the attribute is written, the caller
> provides an attribute value written as if the filesystem was mounted
> in the caller's user namespace.
> Conceptually, this is mostly the same as the UID conversions applied
> by chown() and stat().
Right. If it is a V3, and the .rootid maps to a valid uid in your
namespace besides 0, then .rootid will be mapped to the valid user in your
namespace; if it is 0, then a V2 capability xattr will be presented.
If the real xattr is a V2, then a V2 is presented.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-13 19:26 ` Michael Kerrisk (man-pages)
2018-04-16 14:10 ` Jann Horn
@ 2018-04-20 0:04 ` Serge E. Hallyn
1 sibling, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-04-20 0:04 UTC (permalink / raw)
To: linux-security-module
Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
> Hello Serge, Jann,
>
> On 01/16/2018 06:26 PM, Jann Horn wrote:
> > On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> >> Update the capabilities(7) manpage with a description of the
> >> new-ish namespaced file capability support.
> >>
> >> A note on userspace tools: since the kernel will automatically
> >> convert between v2 and v3 xattrs, and translate nsroot between
> >> v3 xattrs, we can make do with the current getcap(8) and setcap(8)
> >> tools. I.e. a user on the host can create a transient user namespace
> >> with the appropriate mappings and run setcap(8) there. The kernel
> >> will automatically write a v3 xattr with the transient namespace's
> >> root user as nsroot.
>
> After a long gap, I have come back to the task of working up
> some text to describe file capability versioning and namespaced file
> capabilities.
>
> I still not convinced I've captured things correctly, and I still
> have a few questions (see below). But first, here's the text that
> I have so far (suggestions for improvements welcome). These changes
> have already been pushed to the Git repo.
>
> File capability mask versioning
> To allow extensibility, the kernel supports a scheme to encode
> a version number inside the security.capability extended
> attribute that is used to implement file capabilities. These
> version numbers are internal to the implementation, and not
> directly visible to user-space applications. To date, the fol?
> lowing versions are supported:
>
> VFS_CAP_REVISION_1
> This was the original file capability implementation,
> which supported 32-bit masks for file capabilities.
>
> VFS_CAP_REVISION_2 (since Linux 2.6.25)
> This version allows for file capability masks that are
> 64 bits in size, and was necessary as the number of sup?
> ported capabilities grew beyond 32. The kernel trans?
> parently continues to support the execution of files
> that have 32-bit version 1 capability masks, but when
> adding capabilities to files that did not previously
> have capabilities, or modifying the capabilities of
> existing files, it automatically uses the version 2
> scheme (or possibly the version 3 scheme, as described
> below).
>
> VFS_CAP_REVISION_3 (since Linux 4.14)
> Version 3 file capabilities are provided to support
> namespaced file capabilities (described below).
>
> As with version 2 file capabilities, version 3 capabil?
> ity masks are 64 bits in size. But in addition, the
> root user ID of namespace is encoded in the secu?
> rity.capability extended attribute. (A namespace's root
> user ID is the value that user ID 0 inside that names?
> pace maps to in the initial user namespace.)
>
> ["namespace root user ID" is my term for what Serge called nsroot.
> I think it's a little more meaningful, but I am also open to suggestions
> for a better term.]
"mapped root ID" maybe?
>
> Version 3 file capabilities are designed to coexist with
> version 2 capabilities; that is, on a modern Linux sys?
> tem, there may be some files with version 2 capabilities
> while others have version 3 capabilities.
>
> Before Linux 4.14, the only kind of capability mask that could
> be attached to a file was a VFS_CAP_REVISION_2 mask. Since
> Linux 4.14, the version of the capability mask that is attached
> to a file depends on the circumstances in which the secu?
> rity.capability extended attribute was created.
>
> Starting with Linux 4.14, a security.capability extended
> attribute is automatically created as (or converted to) a ver?
> sion 3 (VFS_CAP_REVISION_3) attribute if both of the following
> are true:
>
> (1) The thread writing the attribute resides in a noninitial
> namespace. (More precisely: the thread resides in a user
> namespace other than the one from which the underlying
> filesystem was mounted.)
>
> (2) The thread has the CAP_SETFCAP capability over the file
> inode, meaning that (a) the thread has the CAP_SETFCAP
> capability in its own user namespace; and (b) the UID and
> GID of the file inode have mappings in the writer's user
> namespace.
>
> ???????????????????????????????????????????????????????
> ?FIXME ?
> ???????????????????????????????????????????????????????
> ?Does there also need to be some kind of credential ?
> ?match between the file and the namespace creator ?
> ?UID? ?
> ???????????????????????????????????????????????????????
>
> When a VFS_CAP_REVISION_3 security.capability extended
> attribute is created, the root user ID of the creating thread's
Importantly, that is only when a V3 is *automatically* created to replace
a V2. When a V3 is written, then the .rootid in the V3 is (mapped and)
written as specified.
For instance, root in a namespace can write a V3 xattr that only holds true
in a child namespace where its uid 100k (which could be 200k in the initial
userns) is mapped to root.
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-15 19:22 ` Serge E. Hallyn
@ 2018-04-22 16:46 ` Michael Kerrisk (man-pages)
2018-04-23 17:57 ` Serge E. Hallyn
2018-04-24 15:13 ` Eric W. Biederman
0 siblings, 2 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-04-22 16:46 UTC (permalink / raw)
To: linux-security-module
On 04/15/2018 09:22 PM, Serge E. Hallyn wrote:
> Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
>> On 01/16/2018 06:38 PM, Serge E. Hallyn wrote:
>>> Quoting Jann Horn (jannh at google.com):
>>>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>
>> [...]
>>
>>>>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>>>>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>>>>> +.PP
>>>>> +Users with the required privilege may use
>>>>> +.BR setxattr(2)
>>>>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>>>>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>>>>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>>>>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>>>>> +extended attribute is specified, then the kernel will map the
>>>>> +specified root user ID (which must be a valid user ID mapped in the caller's
>>>>> +user namespace) into the initial user namespace.
>>>>
>>>> Really, "into the initial user namespace"? That may be true for the
>>>> kernel-internal representation, but the on-disk representation is the
>>>> mapping into the user namespace that contains the mount namespace into
>>>> which the file system was mounted, right?
>>>
>>> Ah, yes, it is.
>>>
>>>> This would become observable
>>>> when a file system is mounted in a different namespace than before, or
>>>> when working with FUSE in a namespace.
>>>
>>> Yes it would.
>>>
>>> Michael, you said you were reworking it, do you mind working this into
>>> it as well?
>>
>> So, I must confess that I don't really understand this piece of the
>> conversation--neither Jann's comments nor Serge's response (Serge, are
>> you saying Jann is right or wrong in his comments?). Perhaps this can
>
> He's right. The point is that if a filesystem is mounted by a user in
> a non-init user namespace, then the kernel will map the specified root user ID
> into sb->sb_user_ns, not &init_user_ns.
>
>> be clarified as a response to the man page text in the other mail I
>> just sent?
>
> Yes, I'll try to do that.
So, I think that I am possibly missing some background knowledge here.
Here, I sounds to me like you are talking about mounting a block
filesystem in a non-initial user namespace. (Have I misunderstood?)
But, as I understood it, it is not possible to mount a physical
block-based filesystem from a a non-init user namespace. Is that not
correct? The only types of filesystems that I'm aware of that can be
mounted are those listed in user_namespaces(7):
Holding CAP_SYS_ADMIN within the user namespace associated with a
process's mount namespace allows that process to create bind
mounts and mount the following types of filesystems:
* /proc (since Linux 3.8)
* /sys (since Linux 3.8)
* devpts (since Linux 3.9)
* tmpfs(5) (since Linux 3.9)
* ramfs (since Linux 3.9)
* mqueue (since Linux 3.9)
* bpf (since Linux 4.4)
Holding CAP_SYS_ADMIN within the user namespace associated with a
process's cgroup namespace allows (since Linux 4.6) that process
to the mount the cgroup version 2 filesystem and cgroup version 1
named hierarchies (i.e., cgroup filesystems mounted with the
"none,name=" option).
Do I misunderstand something?
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-22 16:46 ` Michael Kerrisk (man-pages)
@ 2018-04-23 17:57 ` Serge E. Hallyn
2018-04-24 15:13 ` Eric W. Biederman
1 sibling, 0 replies; 22+ messages in thread
From: Serge E. Hallyn @ 2018-04-23 17:57 UTC (permalink / raw)
To: linux-security-module
Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
> On 04/15/2018 09:22 PM, Serge E. Hallyn wrote:
> > Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
> >> On 01/16/2018 06:38 PM, Serge E. Hallyn wrote:
> >>> Quoting Jann Horn (jannh at google.com):
> >>>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> >>
> >> [...]
> >>
> >>>>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
> >>>>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
> >>>>> +.PP
> >>>>> +Users with the required privilege may use
> >>>>> +.BR setxattr(2)
> >>>>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
> >>>>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
> >>>>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
> >>>>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
> >>>>> +extended attribute is specified, then the kernel will map the
> >>>>> +specified root user ID (which must be a valid user ID mapped in the caller's
> >>>>> +user namespace) into the initial user namespace.
> >>>>
> >>>> Really, "into the initial user namespace"? That may be true for the
> >>>> kernel-internal representation, but the on-disk representation is the
> >>>> mapping into the user namespace that contains the mount namespace into
> >>>> which the file system was mounted, right?
> >>>
> >>> Ah, yes, it is.
> >>>
> >>>> This would become observable
> >>>> when a file system is mounted in a different namespace than before, or
> >>>> when working with FUSE in a namespace.
> >>>
> >>> Yes it would.
> >>>
> >>> Michael, you said you were reworking it, do you mind working this into
> >>> it as well?
> >>
> >> So, I must confess that I don't really understand this piece of the
> >> conversation--neither Jann's comments nor Serge's response (Serge, are
> >> you saying Jann is right or wrong in his comments?). Perhaps this can
> >
> > He's right. The point is that if a filesystem is mounted by a user in
> > a non-init user namespace, then the kernel will map the specified root user ID
> > into sb->sb_user_ns, not &init_user_ns.
> >
> >> be clarified as a response to the man page text in the other mail I
> >> just sent?
> >
> > Yes, I'll try to do that.
>
> So, I think that I am possibly missing some background knowledge here.
> Here, I sounds to me like you are talking about mounting a block
> filesystem in a non-initial user namespace. (Have I misunderstood?)
Correct,
> But, as I understood it, it is not possible to mount a physical
> block-based filesystem from a a non-init user namespace. Is that not
> correct? The only types of filesystems that I'm aware of that can be
> mounted are those listed in user_namespaces(7):
>
> Holding CAP_SYS_ADMIN within the user namespace associated with a
> process's mount namespace allows that process to create bind
> mounts and mount the following types of filesystems:
>
> * /proc (since Linux 3.8)
> * /sys (since Linux 3.8)
> * devpts (since Linux 3.9)
> * tmpfs(5) (since Linux 3.9)
> * ramfs (since Linux 3.9)
> * mqueue (since Linux 3.9)
> * bpf (since Linux 4.4)
>
> Holding CAP_SYS_ADMIN within the user namespace associated with a
> process's cgroup namespace allows (since Linux 4.6) that process
> to the mount the cgroup version 2 filesystem and cgroup version 1
> named hierarchies (i.e., cgroup filesystems mounted with the
> "none,name=" option).
>
> Do I misunderstand something?
The work is under way to make it possible to mount fuse filesystems
a from non-initial user namespace, and those patches are already
enabled in the default Ubuntu kernel. That's where this becomes
relevant.
thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-22 16:46 ` Michael Kerrisk (man-pages)
2018-04-23 17:57 ` Serge E. Hallyn
@ 2018-04-24 15:13 ` Eric W. Biederman
1 sibling, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-04-24 15:13 UTC (permalink / raw)
To: linux-security-module
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> On 04/15/2018 09:22 PM, Serge E. Hallyn wrote:
>> Quoting Michael Kerrisk (man-pages) (mtk.manpages at gmail.com):
>>> On 01/16/2018 06:38 PM, Serge E. Hallyn wrote:
>>>> Quoting Jann Horn (jannh at google.com):
>>>>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>>>
>>> [...]
>>>
>>>>>> +A VFS_CAP_REVISION_3 file capability will take effect only when run in a user namespace
>>>>>> +whose UID 0 maps to the saved "nsroot", or a descendant of such a namespace.
>>>>>> +.PP
>>>>>> +Users with the required privilege may use
>>>>>> +.BR setxattr(2)
>>>>>> +to request either a VFS_CAP_REVISION_2 or VFS_CAP_REVISION_3 write.
>>>>>> +The kernel will automatically convert a VFS_CAP_REVISION_2 to a
>>>>>> +VFS_CAP_REVISION_3 extended attribute with the "nsroot"
>>>>>> +set to the root user in the writer's user namespace, or, if a VFS_CAP_REVISION_3
>>>>>> +extended attribute is specified, then the kernel will map the
>>>>>> +specified root user ID (which must be a valid user ID mapped in the caller's
>>>>>> +user namespace) into the initial user namespace.
>>>>>
>>>>> Really, "into the initial user namespace"? That may be true for the
>>>>> kernel-internal representation, but the on-disk representation is the
>>>>> mapping into the user namespace that contains the mount namespace into
>>>>> which the file system was mounted, right?
>>>>
>>>> Ah, yes, it is.
>>>>
>>>>> This would become observable
>>>>> when a file system is mounted in a different namespace than before, or
>>>>> when working with FUSE in a namespace.
>>>>
>>>> Yes it would.
>>>>
>>>> Michael, you said you were reworking it, do you mind working this into
>>>> it as well?
>>>
>>> So, I must confess that I don't really understand this piece of the
>>> conversation--neither Jann's comments nor Serge's response (Serge, are
>>> you saying Jann is right or wrong in his comments?). Perhaps this can
>>
>> He's right. The point is that if a filesystem is mounted by a user in
>> a non-init user namespace, then the kernel will map the specified root user ID
>> into sb->sb_user_ns, not &init_user_ns.
>>
>>> be clarified as a response to the man page text in the other mail I
>>> just sent?
>>
>> Yes, I'll try to do that.
>
> So, I think that I am possibly missing some background knowledge here.
> Here, I sounds to me like you are talking about mounting a block
> filesystem in a non-initial user namespace. (Have I misunderstood?)
A filesystem with backing store certainly.
> But, as I understood it, it is not possible to mount a physical
> block-based filesystem from a a non-init user namespace. Is that not
> correct? The only types of filesystems that I'm aware of that can be
> mounted are those listed in user_namespaces(7):
With a little luck we will have completed the work to mount fuse
filesystems by the next merge window. Currently we are short roughly
two patches needed to make that safe.
There are fuse adaptors for just about everything. Further the design
of the vfs work is to allow block based filesystems.
Hardening a block based in-kernel filesystem to the point where it is
safe to allow it to be mounted is an entirely different matter. But
with the completion of the fuse work it becomes a filesystem by
filesystem question.
Network filesystems where they already need to be skeptical of their
networking peer looks like it will be less of a challenge and we may see
those filesystems change first.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* [manpages PATCH] capabilities.7: describe namespaced file capabilities
2018-04-16 14:10 ` Jann Horn
2018-04-19 23:57 ` Serge E. Hallyn
@ 2018-05-04 15:10 ` Michael Kerrisk (man-pages)
1 sibling, 0 replies; 22+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-05-04 15:10 UTC (permalink / raw)
To: linux-security-module
Hello Jann,
Thanks for your comments. Sorry for the delayed follow-up...
On 04/16/2018 04:10 PM, Jann Horn wrote:
> On Fri, Apr 13, 2018 at 9:26 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> Hello Serge, Jann,
>>
>> On 01/16/2018 06:26 PM, Jann Horn wrote:
>>> On Tue, Jan 9, 2018 at 7:52 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> [...]
>> Starting with Linux 4.14, a security.capability extended
>> attribute is automatically created as (or converted to) a ver?
>> sion 3 (VFS_CAP_REVISION_3) attribute if both of the following
>> are true:
>>
>> (1) The thread writing the attribute resides in a noninitial
>> namespace.
>
> I'm not entirely happy with this - while under most circumstances
> (especially nowadays) correct, isn't this going to confuse readers who
> want to understand the actual rules?
So, you mean that the text should read more likely the parenthesized
part that follows:
>> (More precisely: the thread resides in a user
>> namespace other than the one from which the underlying
>> filesystem was mounted.)
?
> I think if you're in a parent namespace of the user namespace that
> mounted the filesystem, you actually can write a VFS_CAP_REVISION_2
> attribute?
I'm not quite clear. Do you mean this as some correction to my text?
Let me see if I grasp your meaning:
(0) First of all, as things currently stand, filesystems can be
mounted only in the initial user NS (which has no parent). But,
this will change in the future, according to current work on FUSE.
Your comment here related to that future. (Right?)
(1) You mean that a process in the parent user NS could write
(setxattr(2)) a VFS_CAP_REVISION_2 attribute, but what would
actually be recorded is a VFS_CAP_REVISION_3 attribute?
>> (2) The thread has the CAP_SETFCAP capability over the file
>> inode, meaning that (a) the thread has the CAP_SETFCAP
>> capability in its own user namespace; and (b) the UID and
>> GID of the file inode have mappings in the writer's user
>> namespace.
>
>
>> ???????????????????????????????????????????????????????
>> ?FIXME ?
>> ???????????????????????????????????????????????????????
>> ?Does there also need to be some kind of credential ?
>> ?match between the file and the namespace creator ?
>> ?UID? ?
>> ???????????????????????????????????????????????????????
>
> The namespace creator UID (iow, the namespace owner) is irrelevant
> here; the capability model is somewhat inconsistent here. Normal
> capability checks that go down to cap_capable() (like ns_capable())
> grant all privileges to processes in parent namespaces that have an
> EUID that matches the owner UID of one of the intermediate namespaces,
> including the target namespace. But capable_wrt_inode_uidgid() always
> requires the caller to have the specified capability in its own
> namespace because, when operating on an inode, the concept of an
> implicit "target namespace" doesn't really exist. (For a properly
> consistent model, you'd probably need to let the caller explicity
> specify the target namespace, but then that would somewhat break the
> transparency of namespaces.) cap_convert_nscap() starts by checking
> for capable_wrt_inode_uidgid().
Okay -- I think I got this a little twisted. The point here, as far
as I can see, is that there is a credential check involved. The rule
is that from inside the user NS, you can set a VFS_CAP_REVISION_3
only on a file whose (mapped) UID matches the UID 0 of the namespace.
Have I got that right?
> [...]
>> As with a binary that has VFS_CAP_REVISION_2 file capabilities,
>> a binary with VFS_CAP_REVISION_3 file capabilities confers
>> capabilities to a process during execve(). However, capabili?
>> ties are conferred only if the binary is executed by a process
>> that resides in a user namespace whose UID 0 maps to the root
>> user ID that is saved in the extended attribute, or when exe?
>> cuted by a process that resides in descendant of such a names?
>
> Nit: "in a descendant"?
Thanks. Fixed.
> [...]
>>>> Likewise,
>>>> +.BR getxattr(2)
>>>> +results will be converted and simplified to show a VFS_CAP_REVISION_2
>>>> +extended attribute, if a VFS_CAP_REVISION_3 applies to the caller's
>>>> +namespace, or to map the VFS_CAP_REVISION_3 root user ID into the
>>>> +caller's namespace.
>>
>> I haven't captured that last paragraph in my text. I'm not sure I
>> understand the idea being presented. Serge, could you elaborate?
>
> Summary: When you read a capability attribute with getxattr(), the
> kernel will rewrite the returned value such that it looks the way it
> would have to look if the filesystem was mounted in your user
> namespace; just like how, when the attribute is written, the caller
> provides an attribute value written as if the filesystem was mounted
> in the caller's user namespace.
> Conceptually, this is mostly the same as the UID conversions applied
> by chown() and stat().
Okay -- thanks. I got this now. I'll work some text into the page.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2018-05-04 15:10 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-09 18:52 [manpages PATCH] capabilities.7: describe namespaced file capabilities Serge E. Hallyn
2018-01-09 18:52 ` Serge E. Hallyn
[not found] ` <20180109185218.GA21753-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2018-01-14 9:40 ` Michael Kerrisk (man-pages)
2018-01-14 9:40 ` Michael Kerrisk (man-pages)
2018-01-15 4:31 ` Serge E. Hallyn
2018-01-15 4:31 ` Serge E. Hallyn
2018-01-16 17:26 ` Jann Horn
2018-01-16 17:26 ` Jann Horn
2018-01-16 17:38 ` Serge E. Hallyn
2018-01-16 17:38 ` Serge E. Hallyn
[not found] ` <20180116173803.GA15538-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2018-01-17 23:44 ` Michael Kerrisk (man-pages)
2018-01-17 23:44 ` Michael Kerrisk (man-pages)
2018-04-13 19:29 ` Michael Kerrisk (man-pages)
2018-04-15 19:22 ` Serge E. Hallyn
2018-04-22 16:46 ` Michael Kerrisk (man-pages)
2018-04-23 17:57 ` Serge E. Hallyn
2018-04-24 15:13 ` Eric W. Biederman
2018-04-13 19:26 ` Michael Kerrisk (man-pages)
2018-04-16 14:10 ` Jann Horn
2018-04-19 23:57 ` Serge E. Hallyn
2018-05-04 15:10 ` Michael Kerrisk (man-pages)
2018-04-20 0:04 ` Serge E. Hallyn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.