linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 0/7] Initial support for user namespace owned mounts
@ 2015-07-30  4:24 Amir Goldstein
  2015-07-30 13:55 ` Seth Forshee
  2015-07-30 13:57 ` Serge Hallyn
  0 siblings, 2 replies; 69+ messages in thread
From: Amir Goldstein @ 2015-07-30  4:24 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Casey Schaufler, Stephen Smalley, Andy Lutomirski,
	Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
>
> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > This is what I currently think you want for user ns mounts:
> > >
> > >  1. smk_root and smk_default are assigned the label of the backing
> > >     device.

Seth,

There were 2 main concerns discussed in this thread:
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev

While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.

A viable security policy to mitigate the second concern could be:
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images

This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.

Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.

Any thoughts on how to reconcile this conflict?

Amir.


> > >  2. s_root is assigned the transmute property.
> > >  3. For existing files:
> > >     a. Files with the same label as the backing device are accessible.
> > >     b. Files with any other label are not accessible.
> >
> > That's right. Accept correct data, reject anything that's not right.
> >
> > > If this is right, there are a couple lingering questions in my mind.
> > >
> > > First, what happens with files created in directories with the same
> > > label as the backing device but without the transmute property set? The
> > > inode for the new file will initially be labeled with smk_of_current(),
> > > but then during d_instantiate it will get smk_default and thus end up
> > > with the label we want. So that seems okay.
> >
> > Yes.
> >
> > > The second is whether files with the SMACK64EXEC attribute is still a
> > > problem. It seems it is, for files with the same label as the backing
> > > store at least. I think we can simply skip the code that reads out this
> > > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > > label to the new task in bprm_set_creds. The latter seems more
> > > consistent with the approach you've suggested for dealing with labels
> > > from disk.
> >
> > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > smack_d_instantiate for unprivileged mounts would do the trick.
> >
> > > So I guess all of that seems okay, though perhaps a bit restrictive
> > > given that the user who mounted the filesystem already has full access
> > > to the backing store.
> >
> > In truth, there is no reason to expect that the "user" who did the
> > mount will ever have a Smack label that differs from the label of
> > the backing store. If what we've got here seems restrictive, it's
> > because you've got access from someone other than the "user".
> >
> > > Please let me know whether or not this matches up with what you are
> > > thinking, then I can procede with the implementation.
> >
> > My current mindset is that, if you're going to allow unprivileged
> > mounts of user defined backing stores, this is as safe as we can
> > make it.
>
> All right, I've got a patch which I think does this, and I've managed to
> do some testing to confirm that it behaves like I expect. How does this
> look?
>
> What's missing is getting the label from the block device inode; as
> Stephen discovered the inode that I thought we could get the label from
> turned out to be the wrong one. Afaict we would need a new hook in order
> to do that, so for now I'm using the label of the proccess calling
> mount.
>
> ---
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..8e631a66b03c 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>                 skp = smk_of_current();
>                 sp->smk_root = skp;
>                 sp->smk_default = skp;
> +               if (sb_in_userns(sb))
> +                       transmute = 1;
>         }
>         /*
>          * Initialize the root inode.
> @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
>         if (mask == 0)
>                 return 0;
>
> +       if (sb_in_userns(inode->i_sb)) {
> +               struct superblock_smack *sbsp = inode->i_sb->s_security;
> +               if (smk_of_inode(inode) != sbsp->smk_root)
> +                       return -EACCES;
> +       }
> +
>         /* May be droppable after audit */
>         if (no_block)
>                 return -ECHILD;
> @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>                         if (rc >= 0)
>                                 transflag = SMK_INODE_TRANSMUTE;
>                 }
> -               /*
> -                * Don't let the exec or mmap label be "*" or "@".
> -                */
> -               skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> -               if (IS_ERR(skp) || skp == &smack_known_star ||
> -                   skp == &smack_known_web)
> -                       skp = NULL;
> -               isp->smk_task = skp;
> +               if (!sb_in_userns(inode->i_sb)) {
> +                       /*
> +                        * Don't let the exec or mmap label be "*" or "@".
> +                        */
> +                       skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> +                       if (IS_ERR(skp) || skp == &smack_known_star ||
> +                           skp == &smack_known_web)
> +                               skp = NULL;
> +                       isp->smk_task = skp;
> +               }
>
>                 skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
>                 if (IS_ERR(skp) || skp == &smack_known_star ||
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30  4:24 [PATCH 0/7] Initial support for user namespace owned mounts Amir Goldstein
@ 2015-07-30 13:55 ` Seth Forshee
  2015-07-30 14:47   ` Amir Goldstein
  2015-07-30 13:57 ` Serge Hallyn
  1 sibling, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-30 13:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Theodore Ts'o, Casey Schaufler, Stephen Smalley,
	Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> <seth.forshee@canonical.com> wrote:
> >
> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > This is what I currently think you want for user ns mounts:
> > > >
> > > >  1. smk_root and smk_default are assigned the label of the backing
> > > >     device.
> 
> Seth,
> 
> There were 2 main concerns discussed in this thread:
> 1. trusting LSM labels outside the namespace
> 2. trusting the content of the image file/loopdev
> 
> While your approach addresses the first concern, I suspect it may be placing
> an obstacle in a way for resolving the second concern.
> 
> A viable security policy to mitigate the second concern could be:
> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> - Allow mount only of 'Loopback' images
> 
> This should allow the system as a whole to trust unprivileged mounts based on
> the trust of the entities that had raw access the the fs layout.

You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?

That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.

Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.

> Alas, if you choose to propagate the backing dev label to contained files,
> they would all share the designated 'Loopback' label and render the policy above
> useless.
> 
> Any thoughts on how to reconcile this conflict?

I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30  4:24 [PATCH 0/7] Initial support for user namespace owned mounts Amir Goldstein
  2015-07-30 13:55 ` Seth Forshee
@ 2015-07-30 13:57 ` Serge Hallyn
  2015-07-30 15:09   ` Amir Goldstein
  1 sibling, 1 reply; 69+ messages in thread
From: Serge Hallyn @ 2015-07-30 13:57 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Seth Forshee, Casey Schaufler, Stephen Smalley, Andy Lutomirski,
	Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

Quoting Amir Goldstein (amir@cellrox.com):
> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> <seth.forshee@canonical.com> wrote:
> >
> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > This is what I currently think you want for user ns mounts:
> > > >
> > > >  1. smk_root and smk_default are assigned the label of the backing
> > > >     device.
> 
> Seth,
> 
> There were 2 main concerns discussed in this thread:
> 1. trusting LSM labels outside the namespace
> 2. trusting the content of the image file/loopdev
> 
> While your approach addresses the first concern, I suspect it may be placing
> an obstacle in a way for resolving the second concern.
> 
> A viable security policy to mitigate the second concern could be:
> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> - Allow mount only of 'Loopback' images
> 
> This should allow the system as a whole to trust unprivileged mounts based on
> the trust of the entities that had raw access the the fs layout.

Just to be sure I understand right, you're looking for a way to let
the host admin trust that the kernel's superblock parsers aren't being
fed trash or an exploit?

> Alas, if you choose to propagate the backing dev label to contained files,
> they would all share the designated 'Loopback' label and render the policy above
> useless.
> 
> Any thoughts on how to reconcile this conflict?
> 
> Amir.
> 
> 
> > > >  2. s_root is assigned the transmute property.
> > > >  3. For existing files:
> > > >     a. Files with the same label as the backing device are accessible.
> > > >     b. Files with any other label are not accessible.
> > >
> > > That's right. Accept correct data, reject anything that's not right.
> > >
> > > > If this is right, there are a couple lingering questions in my mind.
> > > >
> > > > First, what happens with files created in directories with the same
> > > > label as the backing device but without the transmute property set? The
> > > > inode for the new file will initially be labeled with smk_of_current(),
> > > > but then during d_instantiate it will get smk_default and thus end up
> > > > with the label we want. So that seems okay.
> > >
> > > Yes.
> > >
> > > > The second is whether files with the SMACK64EXEC attribute is still a
> > > > problem. It seems it is, for files with the same label as the backing
> > > > store at least. I think we can simply skip the code that reads out this
> > > > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > > > label to the new task in bprm_set_creds. The latter seems more
> > > > consistent with the approach you've suggested for dealing with labels
> > > > from disk.
> > >
> > > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > > smack_d_instantiate for unprivileged mounts would do the trick.
> > >
> > > > So I guess all of that seems okay, though perhaps a bit restrictive
> > > > given that the user who mounted the filesystem already has full access
> > > > to the backing store.
> > >
> > > In truth, there is no reason to expect that the "user" who did the
> > > mount will ever have a Smack label that differs from the label of
> > > the backing store. If what we've got here seems restrictive, it's
> > > because you've got access from someone other than the "user".
> > >
> > > > Please let me know whether or not this matches up with what you are
> > > > thinking, then I can procede with the implementation.
> > >
> > > My current mindset is that, if you're going to allow unprivileged
> > > mounts of user defined backing stores, this is as safe as we can
> > > make it.
> >
> > All right, I've got a patch which I think does this, and I've managed to
> > do some testing to confirm that it behaves like I expect. How does this
> > look?
> >
> > What's missing is getting the label from the block device inode; as
> > Stephen discovered the inode that I thought we could get the label from
> > turned out to be the wrong one. Afaict we would need a new hook in order
> > to do that, so for now I'm using the label of the proccess calling
> > mount.
> >
> > ---
> >
> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> > index a143328f75eb..8e631a66b03c 100644
> > --- a/security/smack/smack_lsm.c
> > +++ b/security/smack/smack_lsm.c
> > @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >                 skp = smk_of_current();
> >                 sp->smk_root = skp;
> >                 sp->smk_default = skp;
> > +               if (sb_in_userns(sb))
> > +                       transmute = 1;
> >         }
> >         /*
> >          * Initialize the root inode.
> > @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
> >         if (mask == 0)
> >                 return 0;
> >
> > +       if (sb_in_userns(inode->i_sb)) {
> > +               struct superblock_smack *sbsp = inode->i_sb->s_security;
> > +               if (smk_of_inode(inode) != sbsp->smk_root)
> > +                       return -EACCES;
> > +       }
> > +
> >         /* May be droppable after audit */
> >         if (no_block)
> >                 return -ECHILD;
> > @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> >                         if (rc >= 0)
> >                                 transflag = SMK_INODE_TRANSMUTE;
> >                 }
> > -               /*
> > -                * Don't let the exec or mmap label be "*" or "@".
> > -                */
> > -               skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> > -               if (IS_ERR(skp) || skp == &smack_known_star ||
> > -                   skp == &smack_known_web)
> > -                       skp = NULL;
> > -               isp->smk_task = skp;
> > +               if (!sb_in_userns(inode->i_sb)) {
> > +                       /*
> > +                        * Don't let the exec or mmap label be "*" or "@".
> > +                        */
> > +                       skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> > +                       if (IS_ERR(skp) || skp == &smack_known_star ||
> > +                           skp == &smack_known_web)
> > +                               skp = NULL;
> > +                       isp->smk_task = skp;
> > +               }
> >
> >                 skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
> >                 if (IS_ERR(skp) || skp == &smack_known_star ||
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 13:55 ` Seth Forshee
@ 2015-07-30 14:47   ` Amir Goldstein
  2015-07-30 15:33     ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2015-07-30 14:47 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Theodore Ts'o, Casey Schaufler, Stephen Smalley,
	Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
<seth.forshee@canonical.com> wrote:
>
> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
> > On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> > <seth.forshee@canonical.com> wrote:
> > >
> > > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > > This is what I currently think you want for user ns mounts:
> > > > >
> > > > >  1. smk_root and smk_default are assigned the label of the backing
> > > > >     device.
> >
> > Seth,
> >
> > There were 2 main concerns discussed in this thread:
> > 1. trusting LSM labels outside the namespace
> > 2. trusting the content of the image file/loopdev
> >
> > While your approach addresses the first concern, I suspect it may be placing
> > an obstacle in a way for resolving the second concern.
> >
> > A viable security policy to mitigate the second concern could be:
> > - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> > - Allow mount only of 'Loopback' images
> >
> > This should allow the system as a whole to trust unprivileged mounts based on
> > the trust of the entities that had raw access the the fs layout.
>
> You don't really say what you mean by "trusted" programs. In a container
> context I'd have to assume that you mean suid-root or similar programs
> shared into the container by the host. In that case is any new kernel
> functionality even required?

Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.

>
> That also doesn't work for some of our use cases, where we'd like to be
> able to do something like "mount -o loop foo.img /mnt/foo" in an
> unprivileged container where foo.img is not created on the local machine
> and not fully under control of the host environment.

That use case will not be addressed by the policy I suggested,
but the more common case of:
- create a loopback file
- mkfs
- mount
will be addressed.

So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.

>
> Agreed though that the "attack from below" problem for untrusted
> filesystems is still an open question. At minimum we have fuse, which
> has been designed to protect against this threat. Others have mentioned
> on this thread that Ted had said something at kernel summit last year
> about being willing to support ext4 mounts from unprivileged user
> namespaces as well. I've added Ted to the Cc in case he wants to confirm
> or deny this rumor.
>
> > Alas, if you choose to propagate the backing dev label to contained files,
> > they would all share the designated 'Loopback' label and render the policy above
> > useless.
> >
> > Any thoughts on how to reconcile this conflict?
>
> I'm not seeing what the conflict is here - nothing you proposed says
> anything about security labels in the filesystem, and nothing would
> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
> label was desired on the backing device. Care to elaborate?
>
> Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 13:57 ` Serge Hallyn
@ 2015-07-30 15:09   ` Amir Goldstein
  0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2015-07-30 15:09 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Seth Forshee, Casey Schaufler, Stephen Smalley, Andy Lutomirski,
	Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 30, 2015 at 4:57 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> Quoting Amir Goldstein (amir@cellrox.com):
>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>> <seth.forshee@canonical.com> wrote:
>> >
>> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>> > > > This is what I currently think you want for user ns mounts:
>> > > >
>> > > >  1. smk_root and smk_default are assigned the label of the backing
>> > > >     device.
>>
>> Seth,
>>
>> There were 2 main concerns discussed in this thread:
>> 1. trusting LSM labels outside the namespace
>> 2. trusting the content of the image file/loopdev
>>
>> While your approach addresses the first concern, I suspect it may be placing
>> an obstacle in a way for resolving the second concern.
>>
>> A viable security policy to mitigate the second concern could be:
>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>> - Allow mount only of 'Loopback' images
>>
>> This should allow the system as a whole to trust unprivileged mounts based on
>> the trust of the entities that had raw access the the fs layout.
>
> Just to be sure I understand right, you're looking for a way to let
> the host admin trust that the kernel's superblock parsers aren't being
> fed trash or an exploit?

Correct.
I do not believe in the direction of auditing file system code to
vulnerability free level
nor do I think that cryptographically signed file system metadata is
the only way
to ensure an exploit free unprivileged mount.


>
>> Alas, if you choose to propagate the backing dev label to contained files,
>> they would all share the designated 'Loopback' label and render the policy above
>> useless.
>>
>> Any thoughts on how to reconcile this conflict?
>>
>> Amir.
>>
>>
>> > > >  2. s_root is assigned the transmute property.
>> > > >  3. For existing files:
>> > > >     a. Files with the same label as the backing device are accessible.
>> > > >     b. Files with any other label are not accessible.
>> > >
>> > > That's right. Accept correct data, reject anything that's not right.
>> > >
>> > > > If this is right, there are a couple lingering questions in my mind.
>> > > >
>> > > > First, what happens with files created in directories with the same
>> > > > label as the backing device but without the transmute property set? The
>> > > > inode for the new file will initially be labeled with smk_of_current(),
>> > > > but then during d_instantiate it will get smk_default and thus end up
>> > > > with the label we want. So that seems okay.
>> > >
>> > > Yes.
>> > >
>> > > > The second is whether files with the SMACK64EXEC attribute is still a
>> > > > problem. It seems it is, for files with the same label as the backing
>> > > > store at least. I think we can simply skip the code that reads out this
>> > > > xattr and sets smk_task for user ns mounts, or else skip assigning the
>> > > > label to the new task in bprm_set_creds. The latter seems more
>> > > > consistent with the approach you've suggested for dealing with labels
>> > > > from disk.
>> > >
>> > > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> > > smack_d_instantiate for unprivileged mounts would do the trick.
>> > >
>> > > > So I guess all of that seems okay, though perhaps a bit restrictive
>> > > > given that the user who mounted the filesystem already has full access
>> > > > to the backing store.
>> > >
>> > > In truth, there is no reason to expect that the "user" who did the
>> > > mount will ever have a Smack label that differs from the label of
>> > > the backing store. If what we've got here seems restrictive, it's
>> > > because you've got access from someone other than the "user".
>> > >
>> > > > Please let me know whether or not this matches up with what you are
>> > > > thinking, then I can procede with the implementation.
>> > >
>> > > My current mindset is that, if you're going to allow unprivileged
>> > > mounts of user defined backing stores, this is as safe as we can
>> > > make it.
>> >
>> > All right, I've got a patch which I think does this, and I've managed to
>> > do some testing to confirm that it behaves like I expect. How does this
>> > look?
>> >
>> > What's missing is getting the label from the block device inode; as
>> > Stephen discovered the inode that I thought we could get the label from
>> > turned out to be the wrong one. Afaict we would need a new hook in order
>> > to do that, so for now I'm using the label of the proccess calling
>> > mount.
>> >
>> > ---
>> >
>> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>> > index a143328f75eb..8e631a66b03c 100644
>> > --- a/security/smack/smack_lsm.c
>> > +++ b/security/smack/smack_lsm.c
>> > @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>> >                 skp = smk_of_current();
>> >                 sp->smk_root = skp;
>> >                 sp->smk_default = skp;
>> > +               if (sb_in_userns(sb))
>> > +                       transmute = 1;
>> >         }
>> >         /*
>> >          * Initialize the root inode.
>> > @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
>> >         if (mask == 0)
>> >                 return 0;
>> >
>> > +       if (sb_in_userns(inode->i_sb)) {
>> > +               struct superblock_smack *sbsp = inode->i_sb->s_security;
>> > +               if (smk_of_inode(inode) != sbsp->smk_root)
>> > +                       return -EACCES;
>> > +       }
>> > +
>> >         /* May be droppable after audit */
>> >         if (no_block)
>> >                 return -ECHILD;
>> > @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>> >                         if (rc >= 0)
>> >                                 transflag = SMK_INODE_TRANSMUTE;
>> >                 }
>> > -               /*
>> > -                * Don't let the exec or mmap label be "*" or "@".
>> > -                */
>> > -               skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
>> > -               if (IS_ERR(skp) || skp == &smack_known_star ||
>> > -                   skp == &smack_known_web)
>> > -                       skp = NULL;
>> > -               isp->smk_task = skp;
>> > +               if (!sb_in_userns(inode->i_sb)) {
>> > +                       /*
>> > +                        * Don't let the exec or mmap label be "*" or "@".
>> > +                        */
>> > +                       skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
>> > +                       if (IS_ERR(skp) || skp == &smack_known_star ||
>> > +                           skp == &smack_known_web)
>> > +                               skp = NULL;
>> > +                       isp->smk_task = skp;
>> > +               }
>> >
>> >                 skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
>> >                 if (IS_ERR(skp) || skp == &smack_known_star ||
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 14:47   ` Amir Goldstein
@ 2015-07-30 15:33     ` Casey Schaufler
  2015-07-30 15:52       ` Colin Walters
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-30 15:33 UTC (permalink / raw)
  To: Amir Goldstein, Seth Forshee
  Cc: Theodore Ts'o, Stephen Smalley, Andy Lutomirski,
	Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel, Casey Schaufler

On 7/30/2015 7:47 AM, Amir Goldstein wrote:
> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
> <seth.forshee@canonical.com> wrote:
>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>> <seth.forshee@canonical.com> wrote:
>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>
>>>>>>  1. smk_root and smk_default are assigned the label of the backing
>>>>>>     device.
>>> Seth,
>>>
>>> There were 2 main concerns discussed in this thread:
>>> 1. trusting LSM labels outside the namespace
>>> 2. trusting the content of the image file/loopdev
>>>
>>> While your approach addresses the first concern, I suspect it may be placing
>>> an obstacle in a way for resolving the second concern.
>>>
>>> A viable security policy to mitigate the second concern could be:
>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>> - Allow mount only of 'Loopback' images
>>>
>>> This should allow the system as a whole to trust unprivileged mounts based on
>>> the trust of the entities that had raw access the the fs layout.
>> You don't really say what you mean by "trusted" programs. In a container
>> context I'd have to assume that you mean suid-root or similar programs
>> shared into the container by the host. In that case is any new kernel
>> functionality even required?
> Sorry I was not clear. I will try to explain better.
> I meant that the programs are "trusted" by the LSM security policy.
> I envisioned a system where unprivileged user is allowed to spawn
> a container which contains "trusted" programs (e.g. mkfs) that are labeled
> as 'FileSystemTools' by the admin of the host.
> FileSystemTools are allowed to write into Loopback labeled files.

You could do this on a Smack based system. It would require
CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
to set some SMACK64EXEC labels on your FileSystemTools, and
they would have to be written as carefully as the would if they
had "more" privilege. You'd need to designate a repository for
your loopback files. On the whole, it would be unattractive.
I will pass on providing the details for fear someone will like
it well enough to implement.

>> That also doesn't work for some of our use cases, where we'd like to be
>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>> unprivileged container where foo.img is not created on the local machine
>> and not fully under control of the host environment.
> That use case will not be addressed by the policy I suggested,
> but the more common case of:
> - create a loopback file
> - mkfs
> - mount
> will be addressed.
>
> So if the (host) admin of the system trusts that unprivileged user cannot create
> a malicious fs layout using mkfs and fsck alone, then the system is
> relatively safe
> mounting (non fuse) file systems from loopback files.
> IMHO, this statement is going to be easier for Ted to sign.

But that sort of defeats the purpose of unprivileged mounts.
Or rather, you're trying to place restrictions on what an
unprivileged user can do without calling the ability to
violate those restrictions "privilege". 

>
>> Agreed though that the "attack from below" problem for untrusted
>> filesystems is still an open question. At minimum we have fuse, which
>> has been designed to protect against this threat. Others have mentioned
>> on this thread that Ted had said something at kernel summit last year
>> about being willing to support ext4 mounts from unprivileged user
>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>> or deny this rumor.
>>
>>> Alas, if you choose to propagate the backing dev label to contained files,
>>> they would all share the designated 'Loopback' label and render the policy above
>>> useless.
>>>
>>> Any thoughts on how to reconcile this conflict?
>> I'm not seeing what the conflict is here - nothing you proposed says
>> anything about security labels in the filesystem, and nothing would
>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>> label was desired on the backing device. Care to elaborate?
>>
>> Seth


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 15:33     ` Casey Schaufler
@ 2015-07-30 15:52       ` Colin Walters
  2015-07-30 16:15         ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Colin Walters @ 2015-07-30 15:52 UTC (permalink / raw)
  To: Casey Schaufler, Amir Goldstein, Seth Forshee
  Cc: Theodore Ts'o, Stephen Smalley, Andy Lutomirski,
	Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
and these are pretty well covered by:

http://libguestfs.org/

Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio.  You can use
the API, or `guestmount` exposes it via FUSE.

It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
"unprivileged" VM.  I've used it for several projects and been
quite happy.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 15:52       ` Colin Walters
@ 2015-07-30 16:15         ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-30 16:15 UTC (permalink / raw)
  To: Colin Walters
  Cc: Casey Schaufler, Amir Goldstein, Seth Forshee, Theodore Ts'o,
	Stephen Smalley, Andy Lutomirski, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

Colin Walters <walters@verbum.org> writes:

> It's worth noting here that I think a lot of the use cases
> for unprivileged mounts are testing/development type things,
> and these are pretty well covered by:
>
> http://libguestfs.org/
>
> Basically it just runs the host kernel in a VM, and the userspace
> is a minimal agent that you can talk to over virtio.  You can use
> the API, or `guestmount` exposes it via FUSE.
>
> It doesn't magically make the kernel filesystems robust against
> untrusted input, but in the case of compromise, it's an
> "unprivileged" VM.  I've used it for several projects and been
> quite happy.

Thanks for pointing this out.  That makes it clear we only have to get
as far as making fuse work for this work to be useful in practice.

Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-31 19:56 ` Casey Schaufler
@ 2015-08-01 17:01   ` Amir Goldstein
  0 siblings, 0 replies; 69+ messages in thread
From: Amir Goldstein @ 2015-08-01 17:01 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Theodore Ts'o, Stephen Smalley,
	Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Fri, Jul 31, 2015 at 10:56 PM, Casey Schaufler
<casey@schaufler-ca.com> wrote:
> On 7/31/2015 1:11 AM, Amir Goldstein wrote:
>> On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>>>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>>>> <seth.forshee@canonical.com> wrote:
>>>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>>>> <seth.forshee@canonical.com> wrote:
>>>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>>>
>>>>>>>>>  1. smk_root and smk_default are assigned the label of the backing
>>>>>>>>>     device.
>>>>>> Seth,
>>>>>>
>>>>>> There were 2 main concerns discussed in this thread:
>>>>>> 1. trusting LSM labels outside the namespace
>>>>>> 2. trusting the content of the image file/loopdev
>>>>>>
>>>>>> While your approach addresses the first concern, I suspect it may be placing
>>>>>> an obstacle in a way for resolving the second concern.
>>>>>>
>>>>>> A viable security policy to mitigate the second concern could be:
>>>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>>>> - Allow mount only of 'Loopback' images
>>>>>>
>>>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>>>> the trust of the entities that had raw access the the fs layout.
>>>>> You don't really say what you mean by "trusted" programs. In a container
>>>>> context I'd have to assume that you mean suid-root or similar programs
>>>>> shared into the container by the host. In that case is any new kernel
>>>>> functionality even required?
>>>> Sorry I was not clear. I will try to explain better.
>>>> I meant that the programs are "trusted" by the LSM security policy.
>>>> I envisioned a system where unprivileged user is allowed to spawn
>>>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>>>> as 'FileSystemTools' by the admin of the host.
>>>> FileSystemTools are allowed to write into Loopback labeled files.
>>> You could do this on a Smack based system. It would require
>>> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
>>> to set some SMACK64EXEC labels on your FileSystemTools, and
>>> they would have to be written as carefully as the would if they
>>> had "more" privilege. You'd need to designate a repository for
>>> your loopback files. On the whole, it would be unattractive.
>>> I will pass on providing the details for fear someone will like
>>> it well enough to implement.
>>>
>>>>> That also doesn't work for some of our use cases, where we'd like to be
>>>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>>>> unprivileged container where foo.img is not created on the local machine
>>>>> and not fully under control of the host environment.
>>>> That use case will not be addressed by the policy I suggested,
>>>> but the more common case of:
>>>> - create a loopback file
>>>> - mkfs
>>>> - mount
>>>> will be addressed.
>>>>
>>>> So if the (host) admin of the system trusts that unprivileged user cannot create
>>>> a malicious fs layout using mkfs and fsck alone, then the system is
>>>> relatively safe
>>>> mounting (non fuse) file systems from loopback files.
>>>> IMHO, this statement is going to be easier for Ted to sign.
>>> But that sort of defeats the purpose of unprivileged mounts.
>>> Or rather, you're trying to place restrictions on what an
>>> unprivileged user can do without calling the ability to
>>> violate those restrictions "privilege".
>> I don't understand your concern.
>
> My concern is that you're playing a shell game. Allow unprivileged
> mounts, but only of things that where created using privilege. How
> is that better than requiring privilege to do the mount?

To me, the ability of an admin to delegate permissions to unprivileged
user to mkfs/fsck/mount "trusted" loopdevs, sounds very useful.
But I am not going to argue that use case any further.

I do agree that it would have been much better if user namespace
could allow unprivileged mounts of certain non FUSE file systems
without relying on specially crafted security policies, but I do not
see how that can happen.


>
>> I am saying that LSM can come to the rescue, in a use case that
>> many have been considering as unsolvable (i.e. the loopback tampering).
>>
>> Yes, I am trying to place restrictions on what an unprivileged user can do.
>> As it stands right now, user is about to gain the ability to mount FUSE.
>> With some extra care on crafting the policy and without any extra code,
>> user can gain the ability to mount "trusted loopback files".
>> It does not solve all use cases, but it does solve a handful.
>
> As I said, you can do this, but it will be ugly, and people won't
> understand how to use it correctly. The distance between the "trusted"
> creation of the filesystem and the "untrusted" mount is too great.
> Plus, there are too many ways to circumvent the integrity of your
> "trusted" filesystem.
>
>> Anyway, the concern I was raising was about the fact that if files inside
>> the loopback mount inherit the label of the loopback file, this policy is
>> going to be impossible to write.
>> But Stephan has already proposed an alternative to this implicit inherit rule
>> on [PATCH 6/7] thread, so I withdraw my concern.
>
> What Stephan has proposed is dandy for SELinux.
>
>>
>>
>>>>> Agreed though that the "attack from below" problem for untrusted
>>>>> filesystems is still an open question. At minimum we have fuse, which
>>>>> has been designed to protect against this threat. Others have mentioned
>>>>> on this thread that Ted had said something at kernel summit last year
>>>>> about being willing to support ext4 mounts from unprivileged user
>>>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>>>> or deny this rumor.
>>>>>
>>>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>>>> they would all share the designated 'Loopback' label and render the policy above
>>>>>> useless.
>>>>>>
>>>>>> Any thoughts on how to reconcile this conflict?
>>>>> I'm not seeing what the conflict is here - nothing you proposed says
>>>>> anything about security labels in the filesystem, and nothing would
>>>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>>>> label was desired on the backing device. Care to elaborate?
>>>>>
>>>>> Seth
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-31  8:11 Amir Goldstein
@ 2015-07-31 19:56 ` Casey Schaufler
  2015-08-01 17:01   ` Amir Goldstein
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-31 19:56 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Seth Forshee, Theodore Ts'o, Stephen Smalley,
	Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel, Casey Schaufler

On 7/31/2015 1:11 AM, Amir Goldstein wrote:
> On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>>> <seth.forshee@canonical.com> wrote:
>>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>>> <seth.forshee@canonical.com> wrote:
>>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>>
>>>>>>>>  1. smk_root and smk_default are assigned the label of the backing
>>>>>>>>     device.
>>>>> Seth,
>>>>>
>>>>> There were 2 main concerns discussed in this thread:
>>>>> 1. trusting LSM labels outside the namespace
>>>>> 2. trusting the content of the image file/loopdev
>>>>>
>>>>> While your approach addresses the first concern, I suspect it may be placing
>>>>> an obstacle in a way for resolving the second concern.
>>>>>
>>>>> A viable security policy to mitigate the second concern could be:
>>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>>> - Allow mount only of 'Loopback' images
>>>>>
>>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>>> the trust of the entities that had raw access the the fs layout.
>>>> You don't really say what you mean by "trusted" programs. In a container
>>>> context I'd have to assume that you mean suid-root or similar programs
>>>> shared into the container by the host. In that case is any new kernel
>>>> functionality even required?
>>> Sorry I was not clear. I will try to explain better.
>>> I meant that the programs are "trusted" by the LSM security policy.
>>> I envisioned a system where unprivileged user is allowed to spawn
>>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>>> as 'FileSystemTools' by the admin of the host.
>>> FileSystemTools are allowed to write into Loopback labeled files.
>> You could do this on a Smack based system. It would require
>> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
>> to set some SMACK64EXEC labels on your FileSystemTools, and
>> they would have to be written as carefully as the would if they
>> had "more" privilege. You'd need to designate a repository for
>> your loopback files. On the whole, it would be unattractive.
>> I will pass on providing the details for fear someone will like
>> it well enough to implement.
>>
>>>> That also doesn't work for some of our use cases, where we'd like to be
>>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>>> unprivileged container where foo.img is not created on the local machine
>>>> and not fully under control of the host environment.
>>> That use case will not be addressed by the policy I suggested,
>>> but the more common case of:
>>> - create a loopback file
>>> - mkfs
>>> - mount
>>> will be addressed.
>>>
>>> So if the (host) admin of the system trusts that unprivileged user cannot create
>>> a malicious fs layout using mkfs and fsck alone, then the system is
>>> relatively safe
>>> mounting (non fuse) file systems from loopback files.
>>> IMHO, this statement is going to be easier for Ted to sign.
>> But that sort of defeats the purpose of unprivileged mounts.
>> Or rather, you're trying to place restrictions on what an
>> unprivileged user can do without calling the ability to
>> violate those restrictions "privilege".
> I don't understand your concern.

My concern is that you're playing a shell game. Allow unprivileged
mounts, but only of things that where created using privilege. How
is that better than requiring privilege to do the mount?

> I am saying that LSM can come to the rescue, in a use case that
> many have been considering as unsolvable (i.e. the loopback tampering).
>
> Yes, I am trying to place restrictions on what an unprivileged user can do.
> As it stands right now, user is about to gain the ability to mount FUSE.
> With some extra care on crafting the policy and without any extra code,
> user can gain the ability to mount "trusted loopback files".
> It does not solve all use cases, but it does solve a handful.

As I said, you can do this, but it will be ugly, and people won't
understand how to use it correctly. The distance between the "trusted"
creation of the filesystem and the "untrusted" mount is too great.
Plus, there are too many ways to circumvent the integrity of your
"trusted" filesystem.

> Anyway, the concern I was raising was about the fact that if files inside
> the loopback mount inherit the label of the loopback file, this policy is
> going to be impossible to write.
> But Stephan has already proposed an alternative to this implicit inherit rule
> on [PATCH 6/7] thread, so I withdraw my concern.

What Stephan has proposed is dandy for SELinux.

>
>
>>>> Agreed though that the "attack from below" problem for untrusted
>>>> filesystems is still an open question. At minimum we have fuse, which
>>>> has been designed to protect against this threat. Others have mentioned
>>>> on this thread that Ted had said something at kernel summit last year
>>>> about being willing to support ext4 mounts from unprivileged user
>>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>>> or deny this rumor.
>>>>
>>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>>> they would all share the designated 'Loopback' label and render the policy above
>>>>> useless.
>>>>>
>>>>> Any thoughts on how to reconcile this conflict?
>>>> I'm not seeing what the conflict is here - nothing you proposed says
>>>> anything about security labels in the filesystem, and nothing would
>>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>>> label was desired on the backing device. Care to elaborate?
>>>>
>>>> Seth


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
@ 2015-07-31  8:11 Amir Goldstein
  2015-07-31 19:56 ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Amir Goldstein @ 2015-07-31  8:11 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Theodore Ts'o, Stephen Smalley,
	Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>> <seth.forshee@canonical.com> wrote:
>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>> <seth.forshee@canonical.com> wrote:
>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>
>>>>>>>  1. smk_root and smk_default are assigned the label of the backing
>>>>>>>     device.
>>>> Seth,
>>>>
>>>> There were 2 main concerns discussed in this thread:
>>>> 1. trusting LSM labels outside the namespace
>>>> 2. trusting the content of the image file/loopdev
>>>>
>>>> While your approach addresses the first concern, I suspect it may be placing
>>>> an obstacle in a way for resolving the second concern.
>>>>
>>>> A viable security policy to mitigate the second concern could be:
>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>> - Allow mount only of 'Loopback' images
>>>>
>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>> the trust of the entities that had raw access the the fs layout.
>>> You don't really say what you mean by "trusted" programs. In a container
>>> context I'd have to assume that you mean suid-root or similar programs
>>> shared into the container by the host. In that case is any new kernel
>>> functionality even required?
>> Sorry I was not clear. I will try to explain better.
>> I meant that the programs are "trusted" by the LSM security policy.
>> I envisioned a system where unprivileged user is allowed to spawn
>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>> as 'FileSystemTools' by the admin of the host.
>> FileSystemTools are allowed to write into Loopback labeled files.
>
> You could do this on a Smack based system. It would require
> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
> to set some SMACK64EXEC labels on your FileSystemTools, and
> they would have to be written as carefully as the would if they
> had "more" privilege. You'd need to designate a repository for
> your loopback files. On the whole, it would be unattractive.
> I will pass on providing the details for fear someone will like
> it well enough to implement.
>
>>> That also doesn't work for some of our use cases, where we'd like to be
>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>> unprivileged container where foo.img is not created on the local machine
>>> and not fully under control of the host environment.
>> That use case will not be addressed by the policy I suggested,
>> but the more common case of:
>> - create a loopback file
>> - mkfs
>> - mount
>> will be addressed.
>>
>> So if the (host) admin of the system trusts that unprivileged user cannot create
>> a malicious fs layout using mkfs and fsck alone, then the system is
>> relatively safe
>> mounting (non fuse) file systems from loopback files.
>> IMHO, this statement is going to be easier for Ted to sign.
>
> But that sort of defeats the purpose of unprivileged mounts.
> Or rather, you're trying to place restrictions on what an
> unprivileged user can do without calling the ability to
> violate those restrictions "privilege".

I don't understand your concern.
I am saying that LSM can come to the rescue, in a use case that
many have been considering as unsolvable (i.e. the loopback tampering).

Yes, I am trying to place restrictions on what an unprivileged user can do.
As it stands right now, user is about to gain the ability to mount FUSE.
With some extra care on crafting the policy and without any extra code,
user can gain the ability to mount "trusted loopback files".
It does not solve all use cases, but it does solve a handful.

Anyway, the concern I was raising was about the fact that if files inside
the loopback mount inherit the label of the loopback file, this policy is
going to be impossible to write.
But Stephan has already proposed an alternative to this implicit inherit rule
on [PATCH 6/7] thread, so I withdraw my concern.


>
>>
>>> Agreed though that the "attack from below" problem for untrusted
>>> filesystems is still an open question. At minimum we have fuse, which
>>> has been designed to protect against this threat. Others have mentioned
>>> on this thread that Ted had said something at kernel summit last year
>>> about being willing to support ext4 mounts from unprivileged user
>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>> or deny this rumor.
>>>
>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>> they would all share the designated 'Loopback' label and render the policy above
>>>> useless.
>>>>
>>>> Any thoughts on how to reconcile this conflict?
>>> I'm not seeing what the conflict is here - nothing you proposed says
>>> anything about security labels in the filesystem, and nothing would
>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>> label was desired on the backing device. Care to elaborate?
>>>
>>> Seth
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 17:25                                       ` Seth Forshee
@ 2015-07-30 17:33                                         ` Eric W. Biederman
  0 siblings, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-30 17:33 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Casey Schaufler, Stephen Smalley, Andy Lutomirski,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

Seth Forshee <seth.forshee@canonical.com> writes:

> On Thu, Jul 30, 2015 at 12:05:27PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <casey@schaufler-ca.com> writes:
>> 
>> > On 7/28/2015 1:40 PM, Seth Forshee wrote:
>> >> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>> >>>> This is what I currently think you want for user ns mounts:
>> >>>>
>> >>>>  1. smk_root and smk_default are assigned the label of the backing
>> >>>>     device.
>> >>>>  2. s_root is assigned the transmute property.
>> >>>>  3. For existing files:
>> >>>>     a. Files with the same label as the backing device are accessible.
>> >>>>     b. Files with any other label are not accessible.
>> >>> That's right. Accept correct data, reject anything that's not right.
>> >>>
>> >>>> If this is right, there are a couple lingering questions in my mind.
>> >>>>
>> >>>> First, what happens with files created in directories with the same
>> >>>> label as the backing device but without the transmute property set? The
>> >>>> inode for the new file will initially be labeled with smk_of_current(),
>> >>>> but then during d_instantiate it will get smk_default and thus end up
>> >>>> with the label we want. So that seems okay.
>> >>> Yes.
>> >>>
>> >>>> The second is whether files with the SMACK64EXEC attribute is still a
>> >>>> problem. It seems it is, for files with the same label as the backing
>> >>>> store at least. I think we can simply skip the code that reads out this
>> >>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>> >>>> label to the new task in bprm_set_creds. The latter seems more
>> >>>> consistent with the approach you've suggested for dealing with labels
>> >>>> from disk.
>> >>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> >>> smack_d_instantiate for unprivileged mounts would do the trick.
>> >>>
>> >>>> So I guess all of that seems okay, though perhaps a bit restrictive
>> >>>> given that the user who mounted the filesystem already has full access
>> >>>> to the backing store.
>> >>> In truth, there is no reason to expect that the "user" who did the
>> >>> mount will ever have a Smack label that differs from the label of
>> >>> the backing store. If what we've got here seems restrictive, it's
>> >>> because you've got access from someone other than the "user".
>> >>>
>> >>>> Please let me know whether or not this matches up with what you are
>> >>>> thinking, then I can procede with the implementation.
>> >>> My current mindset is that, if you're going to allow unprivileged
>> >>> mounts of user defined backing stores, this is as safe as we can
>> >>> make it.
>> >> All right, I've got a patch which I think does this, and I've managed to
>> >> do some testing to confirm that it behaves like I expect. How does this
>> >> look?
>> >>
>> >> What's missing is getting the label from the block device inode; as
>> >> Stephen discovered the inode that I thought we could get the label from
>> >> turned out to be the wrong one. Afaict we would need a new hook in order
>> >> to do that, so for now I'm using the label of the proccess calling
>> >> mount.
>> >
>> > That will be OK if the mount processing checks for write access to
>> > the backing store. I haven't looked to see if it does. If it doesn't
>> > the problems should be pretty obvious.
>> 
>> 
>> do_new_mount
>>   vfs_kern_mount
>>     mount_fs
>>       ...
>>         mount_bdev
>>           blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
>>             lookup_bdev
>>               kern_path
>>                 filename_lookup
>>                   path_lookupat
>>                     lookup_last
>>                       walk_component
>>             blkdev_get(...,mode,...)
>>               __blkdev_get(...,mode,...)
>>                 devcgroup_inode_permission(bdev->bd_inode, perm)
>> 
>> *scratches my head*
>> 
>> It looks like we don't actually check the permissions on the block
>> device.  Tomoyo has a hack for it.  nfsd does something.  There is
>> devcgroup silliness.
>> 
>> But overall it looks like we depend on capable(CAP_SYS_ADMIN).
>> 
>> Seth I do believe we have found another area of the vfs we will need to
>> short up before allowing unprivileged mounts of block device based
>> filesystems.
>> 
>> It looks like there are enough hacks someone with a clue coming through
>> and making the code make more sense seems like a good idea anyway.
>
> Yep, I just came to the same conclusion myself, and I also verified the
> behavior emperically. That's definitely a problem. I'll get to work on
> fixing that.

At a quick glance it looks like lookup_bdev, and most of it's callers
need to be modified to do potentially do the additional permission
checking.

I expect we could move the devcgroup checks into whatever new checks we
wind up adding.

Fun, fun fun.

Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 17:05                                     ` Eric W. Biederman
@ 2015-07-30 17:25                                       ` Seth Forshee
  2015-07-30 17:33                                         ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-30 17:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Stephen Smalley, Andy Lutomirski,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

On Thu, Jul 30, 2015 at 12:05:27PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
> 
> > On 7/28/2015 1:40 PM, Seth Forshee wrote:
> >> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> >>>> This is what I currently think you want for user ns mounts:
> >>>>
> >>>>  1. smk_root and smk_default are assigned the label of the backing
> >>>>     device.
> >>>>  2. s_root is assigned the transmute property.
> >>>>  3. For existing files:
> >>>>     a. Files with the same label as the backing device are accessible.
> >>>>     b. Files with any other label are not accessible.
> >>> That's right. Accept correct data, reject anything that's not right.
> >>>
> >>>> If this is right, there are a couple lingering questions in my mind.
> >>>>
> >>>> First, what happens with files created in directories with the same
> >>>> label as the backing device but without the transmute property set? The
> >>>> inode for the new file will initially be labeled with smk_of_current(),
> >>>> but then during d_instantiate it will get smk_default and thus end up
> >>>> with the label we want. So that seems okay.
> >>> Yes.
> >>>
> >>>> The second is whether files with the SMACK64EXEC attribute is still a
> >>>> problem. It seems it is, for files with the same label as the backing
> >>>> store at least. I think we can simply skip the code that reads out this
> >>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
> >>>> label to the new task in bprm_set_creds. The latter seems more
> >>>> consistent with the approach you've suggested for dealing with labels
> >>>> from disk.
> >>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> >>> smack_d_instantiate for unprivileged mounts would do the trick.
> >>>
> >>>> So I guess all of that seems okay, though perhaps a bit restrictive
> >>>> given that the user who mounted the filesystem already has full access
> >>>> to the backing store.
> >>> In truth, there is no reason to expect that the "user" who did the
> >>> mount will ever have a Smack label that differs from the label of
> >>> the backing store. If what we've got here seems restrictive, it's
> >>> because you've got access from someone other than the "user".
> >>>
> >>>> Please let me know whether or not this matches up with what you are
> >>>> thinking, then I can procede with the implementation.
> >>> My current mindset is that, if you're going to allow unprivileged
> >>> mounts of user defined backing stores, this is as safe as we can
> >>> make it.
> >> All right, I've got a patch which I think does this, and I've managed to
> >> do some testing to confirm that it behaves like I expect. How does this
> >> look?
> >>
> >> What's missing is getting the label from the block device inode; as
> >> Stephen discovered the inode that I thought we could get the label from
> >> turned out to be the wrong one. Afaict we would need a new hook in order
> >> to do that, so for now I'm using the label of the proccess calling
> >> mount.
> >
> > That will be OK if the mount processing checks for write access to
> > the backing store. I haven't looked to see if it does. If it doesn't
> > the problems should be pretty obvious.
> 
> 
> do_new_mount
>   vfs_kern_mount
>     mount_fs
>       ...
>         mount_bdev
>           blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
>             lookup_bdev
>               kern_path
>                 filename_lookup
>                   path_lookupat
>                     lookup_last
>                       walk_component
>             blkdev_get(...,mode,...)
>               __blkdev_get(...,mode,...)
>                 devcgroup_inode_permission(bdev->bd_inode, perm)
> 
> *scratches my head*
> 
> It looks like we don't actually check the permissions on the block
> device.  Tomoyo has a hack for it.  nfsd does something.  There is
> devcgroup silliness.
> 
> But overall it looks like we depend on capable(CAP_SYS_ADMIN).
> 
> Seth I do believe we have found another area of the vfs we will need to
> short up before allowing unprivileged mounts of block device based
> filesystems.
> 
> It looks like there are enough hacks someone with a clue coming through
> and making the code make more sense seems like a good idea anyway.

Yep, I just came to the same conclusion myself, and I also verified the
behavior emperically. That's definitely a problem. I'll get to work on
fixing that.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-30 16:18                                   ` Casey Schaufler
@ 2015-07-30 17:05                                     ` Eric W. Biederman
  2015-07-30 17:25                                       ` Seth Forshee
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-30 17:05 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Stephen Smalley, Andy Lutomirski, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

Casey Schaufler <casey@schaufler-ca.com> writes:

> On 7/28/2015 1:40 PM, Seth Forshee wrote:
>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>> This is what I currently think you want for user ns mounts:
>>>>
>>>>  1. smk_root and smk_default are assigned the label of the backing
>>>>     device.
>>>>  2. s_root is assigned the transmute property.
>>>>  3. For existing files:
>>>>     a. Files with the same label as the backing device are accessible.
>>>>     b. Files with any other label are not accessible.
>>> That's right. Accept correct data, reject anything that's not right.
>>>
>>>> If this is right, there are a couple lingering questions in my mind.
>>>>
>>>> First, what happens with files created in directories with the same
>>>> label as the backing device but without the transmute property set? The
>>>> inode for the new file will initially be labeled with smk_of_current(),
>>>> but then during d_instantiate it will get smk_default and thus end up
>>>> with the label we want. So that seems okay.
>>> Yes.
>>>
>>>> The second is whether files with the SMACK64EXEC attribute is still a
>>>> problem. It seems it is, for files with the same label as the backing
>>>> store at least. I think we can simply skip the code that reads out this
>>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>>> label to the new task in bprm_set_creds. The latter seems more
>>>> consistent with the approach you've suggested for dealing with labels
>>>> from disk.
>>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>>> smack_d_instantiate for unprivileged mounts would do the trick.
>>>
>>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>>> given that the user who mounted the filesystem already has full access
>>>> to the backing store.
>>> In truth, there is no reason to expect that the "user" who did the
>>> mount will ever have a Smack label that differs from the label of
>>> the backing store. If what we've got here seems restrictive, it's
>>> because you've got access from someone other than the "user".
>>>
>>>> Please let me know whether or not this matches up with what you are
>>>> thinking, then I can procede with the implementation.
>>> My current mindset is that, if you're going to allow unprivileged
>>> mounts of user defined backing stores, this is as safe as we can
>>> make it.
>> All right, I've got a patch which I think does this, and I've managed to
>> do some testing to confirm that it behaves like I expect. How does this
>> look?
>>
>> What's missing is getting the label from the block device inode; as
>> Stephen discovered the inode that I thought we could get the label from
>> turned out to be the wrong one. Afaict we would need a new hook in order
>> to do that, so for now I'm using the label of the proccess calling
>> mount.
>
> That will be OK if the mount processing checks for write access to
> the backing store. I haven't looked to see if it does. If it doesn't
> the problems should be pretty obvious.


do_new_mount
  vfs_kern_mount
    mount_fs
      ...
        mount_bdev
          blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
            lookup_bdev
              kern_path
                filename_lookup
                  path_lookupat
                    lookup_last
                      walk_component
            blkdev_get(...,mode,...)
              __blkdev_get(...,mode,...)
                devcgroup_inode_permission(bdev->bd_inode, perm)

*scratches my head*

It looks like we don't actually check the permissions on the block
device.  Tomoyo has a hack for it.  nfsd does something.  There is
devcgroup silliness.

But overall it looks like we depend on capable(CAP_SYS_ADMIN).

Seth I do believe we have found another area of the vfs we will need to
short up before allowing unprivileged mounts of block device based
filesystems.

It looks like there are enough hacks someone with a clue coming through
and making the code make more sense seems like a good idea anyway.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-28 20:40                                 ` Seth Forshee
@ 2015-07-30 16:18                                   ` Casey Schaufler
  2015-07-30 17:05                                     ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-30 16:18 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Stephen Smalley, Andy Lutomirski, Eric W. Biederman,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

On 7/28/2015 1:40 PM, Seth Forshee wrote:
> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>> This is what I currently think you want for user ns mounts:
>>>
>>>  1. smk_root and smk_default are assigned the label of the backing
>>>     device.
>>>  2. s_root is assigned the transmute property.
>>>  3. For existing files:
>>>     a. Files with the same label as the backing device are accessible.
>>>     b. Files with any other label are not accessible.
>> That's right. Accept correct data, reject anything that's not right.
>>
>>> If this is right, there are a couple lingering questions in my mind.
>>>
>>> First, what happens with files created in directories with the same
>>> label as the backing device but without the transmute property set? The
>>> inode for the new file will initially be labeled with smk_of_current(),
>>> but then during d_instantiate it will get smk_default and thus end up
>>> with the label we want. So that seems okay.
>> Yes.
>>
>>> The second is whether files with the SMACK64EXEC attribute is still a
>>> problem. It seems it is, for files with the same label as the backing
>>> store at least. I think we can simply skip the code that reads out this
>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>> label to the new task in bprm_set_creds. The latter seems more
>>> consistent with the approach you've suggested for dealing with labels
>>> from disk.
>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> smack_d_instantiate for unprivileged mounts would do the trick.
>>
>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>> given that the user who mounted the filesystem already has full access
>>> to the backing store.
>> In truth, there is no reason to expect that the "user" who did the
>> mount will ever have a Smack label that differs from the label of
>> the backing store. If what we've got here seems restrictive, it's
>> because you've got access from someone other than the "user".
>>
>>> Please let me know whether or not this matches up with what you are
>>> thinking, then I can procede with the implementation.
>> My current mindset is that, if you're going to allow unprivileged
>> mounts of user defined backing stores, this is as safe as we can
>> make it.
> All right, I've got a patch which I think does this, and I've managed to
> do some testing to confirm that it behaves like I expect. How does this
> look?
>
> What's missing is getting the label from the block device inode; as
> Stephen discovered the inode that I thought we could get the label from
> turned out to be the wrong one. Afaict we would need a new hook in order
> to do that, so for now I'm using the label of the proccess calling
> mount.

That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.

>
> ---
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..8e631a66b03c 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>  		skp = smk_of_current();
>  		sp->smk_root = skp;
>  		sp->smk_default = skp;
> +		if (sb_in_userns(sb))
> +			transmute = 1;
>  	}
>  	/*
>  	 * Initialize the root inode.
> @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
>  	if (mask == 0)
>  		return 0;
>  
> +	if (sb_in_userns(inode->i_sb)) {
> +		struct superblock_smack *sbsp = inode->i_sb->s_security;
> +		if (smk_of_inode(inode) != sbsp->smk_root)
> +			return -EACCES;
> +	}
> +
>  	/* May be droppable after audit */
>  	if (no_block)
>  		return -ECHILD;
> @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>  			if (rc >= 0)
>  				transflag = SMK_INODE_TRANSMUTE;
>  		}
> -		/*
> -		 * Don't let the exec or mmap label be "*" or "@".
> -		 */
> -		skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> -		if (IS_ERR(skp) || skp == &smack_known_star ||
> -		    skp == &smack_known_web)
> -			skp = NULL;
> -		isp->smk_task = skp;
> +		if (!sb_in_userns(inode->i_sb)) {
> +			/*
> +			 * Don't let the exec or mmap label be "*" or "@".
> +			 */
> +			skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> +			if (IS_ERR(skp) || skp == &smack_known_star ||
> +			    skp == &smack_known_web)
> +				skp = NULL;
> +			isp->smk_task = skp;
> +		}
>  
>  		skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
>  		if (IS_ERR(skp) || skp == &smack_known_star ||
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23  0:05                               ` Casey Schaufler
  2015-07-23  0:15                                 ` Eric W. Biederman
@ 2015-07-28 20:40                                 ` Seth Forshee
  2015-07-30 16:18                                   ` Casey Schaufler
  1 sibling, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-28 20:40 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Stephen Smalley, Andy Lutomirski, Eric W. Biederman,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > This is what I currently think you want for user ns mounts:
> >
> >  1. smk_root and smk_default are assigned the label of the backing
> >     device.
> >  2. s_root is assigned the transmute property.
> >  3. For existing files:
> >     a. Files with the same label as the backing device are accessible.
> >     b. Files with any other label are not accessible.
> 
> That's right. Accept correct data, reject anything that's not right.
> 
> > If this is right, there are a couple lingering questions in my mind.
> >
> > First, what happens with files created in directories with the same
> > label as the backing device but without the transmute property set? The
> > inode for the new file will initially be labeled with smk_of_current(),
> > but then during d_instantiate it will get smk_default and thus end up
> > with the label we want. So that seems okay.
> 
> Yes.
> 
> > The second is whether files with the SMACK64EXEC attribute is still a
> > problem. It seems it is, for files with the same label as the backing
> > store at least. I think we can simply skip the code that reads out this
> > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > label to the new task in bprm_set_creds. The latter seems more
> > consistent with the approach you've suggested for dealing with labels
> > from disk.
> 
> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> smack_d_instantiate for unprivileged mounts would do the trick.
> 
> > So I guess all of that seems okay, though perhaps a bit restrictive
> > given that the user who mounted the filesystem already has full access
> > to the backing store.
> 
> In truth, there is no reason to expect that the "user" who did the
> mount will ever have a Smack label that differs from the label of
> the backing store. If what we've got here seems restrictive, it's
> because you've got access from someone other than the "user".
> 
> > Please let me know whether or not this matches up with what you are
> > thinking, then I can procede with the implementation.
> 
> My current mindset is that, if you're going to allow unprivileged
> mounts of user defined backing stores, this is as safe as we can
> make it.

All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?

What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.

---

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
 		skp = smk_of_current();
 		sp->smk_root = skp;
 		sp->smk_default = skp;
+		if (sb_in_userns(sb))
+			transmute = 1;
 	}
 	/*
 	 * Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
 	if (mask == 0)
 		return 0;
 
+	if (sb_in_userns(inode->i_sb)) {
+		struct superblock_smack *sbsp = inode->i_sb->s_security;
+		if (smk_of_inode(inode) != sbsp->smk_root)
+			return -EACCES;
+	}
+
 	/* May be droppable after audit */
 	if (no_block)
 		return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
 			if (rc >= 0)
 				transflag = SMK_INODE_TRANSMUTE;
 		}
-		/*
-		 * Don't let the exec or mmap label be "*" or "@".
-		 */
-		skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
-		if (IS_ERR(skp) || skp == &smack_known_star ||
-		    skp == &smack_known_web)
-			skp = NULL;
-		isp->smk_task = skp;
+		if (!sb_in_userns(inode->i_sb)) {
+			/*
+			 * Don't let the exec or mmap label be "*" or "@".
+			 */
+			skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+			if (IS_ERR(skp) || skp == &smack_known_star ||
+			    skp == &smack_known_web)
+				skp = NULL;
+			isp->smk_task = skp;
+		}
 
 		skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
 		if (IS_ERR(skp) || skp == &smack_known_star ||

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23 13:19                               ` J. Bruce Fields
@ 2015-07-23 23:48                                 ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2015-07-23 23:48 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Austin S Hemmelgarn, Eric W. Biederman, Casey Schaufler,
	Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 23, 2015 at 09:19:28AM -0400, J. Bruce Fields wrote:
> On Thu, Jul 23, 2015 at 11:51:35AM +1000, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> > > On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > > > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > > > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > > > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > > > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > > > >>>result in creating a cycle in the dcache and then deadlocking.
> > > > >>
> > > > >>Therein lies the problem: how do you detect such structural defects
> > > > >>without doing a full structure validation?
> > > > >
> > > > >You can prevent cycles in a graph if you can prevent adding an edge
> > > > >which would be part of a cycle.
> > > > >
> > > > Except if the user can write to the filesystem's backing storage (be
> > > > it a device or a file), and has sufficient knowledge of the on-disk
> > > > structures, they can create all the cycles they want in the
> > > > metadata. So unless the kernel builds the graph internally by
> > > > parsing the metadata _and_ has some way to detect that the on-disk
> > > > metadata has hit a cycle (which may not just involve 2 items),
> > > 
> > > Understood.  Again, see the d_ancestor call in d_splice_alias, this is
> > > exactly what it checks for.
> > 
> > But that only addresses one type of loop in one specific metadata
> > structure.
> 
> Yep, agreed!
> 
> > There's plenty of other ways you could construct metadata
> > loops that are essentially undetected and result in either deadlock
> > or livelock within the filesystem code itself. e.g. just make btree
> > sibling pointers loop over a range of entries that have the same
> > index key (e.g. free space extents of the same size). If allocation
> > then falls into this loop, the kernel will just spin searching the
> > same blocks for something it will never find.  Such resource
> > consumption attacks are trivial to construct but extremely difficult
> > to detect because they exploit normal behaviour of the structure and
> > algorithms by mangling trusted pointers.
> 
> Interesting example, thanks!  I doubt this particular example would be
> *that* hard to detect?

Yes, it can be detected, but it's not as easy as it sounds because
of abstractions between tree walking and record parsing.

>  But understood that there may be lots of others.

Yeah, that's just one of many, many ways I can think of modifying
on disk structures to screw up the kernel.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23  0:15                                 ` Eric W. Biederman
  2015-07-23  5:15                                   ` Seth Forshee
@ 2015-07-23 21:48                                   ` Casey Schaufler
  1 sibling, 0 replies; 69+ messages in thread
From: Casey Schaufler @ 2015-07-23 21:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Andy Lutomirski, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel,
	Casey Schaufler

On 7/22/2015 5:15 PM, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
>
>> On 7/22/2015 12:32 PM, Seth Forshee wrote:
>>> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>>>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>>>> with backing store.  They wouldn't make sense for filesystems without
>>>>>>>>>> backing store.
>>>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>>>> created with, which would be the label of the process creating
>>>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>>>> come up with how to determine it.
>>>>>>>>>
>>>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>>>> backing store then use the label of the process creating the
>>>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>>>> will work hunky dory.
>>>>>>>>>
>>>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>>>> the label from the backing store or the creating process is
>>>>>>>>> simple enough.
>>>>>>>>>
>>>>>>> So something like the diff below (untested)?
>>>>>> I think that this is close, and quite good for someone
>>>>>> who isn't very familiar with Smack. It's definitely headed
>>>>>> in the right direction.
>>>>>>
>>>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>>>> then using it instead of smk_of_current() in
>>>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>>>> smack_d_instantiate().
>>>>>> Let's say your backing store is a file labeled Rubble.
>>>>>>
>>>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>>>
>>>>>> It is completely reasonable for a process labeled Flintstone to
>>>>>> have rwxa access to a file labeled Rubble.
>>>>>>
>>>>>> Smack rule: Flintstone Rubble rwxa
>>>>>>
>>>>>> In the case of writing to an existing Rubble file, what you
>>>>>> have looks fine. What's not so great is that if the Flintstone
>>>>>> process creates a file, it should be labeled Flintstone. Your
>>>>>> use of the smk_default, which is going to violate the principle
>>>>>> of least astonishment, and break the Smack policy as well.
>>>>>>
>>>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>>>> use smackfstransmute and a slightly different access rule:
>>>>>>
>>>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>>>
>>>>>> Smack rule: Flintstone Rubble rwxat
>>>>>>
>>>>>> Now the only change we have to make to the Smack code is
>>>>>> that we don't want to create any files unless either the
>>>>>> process is labeled Rubble or the rule allowing the creation
>>>>>> has the "t" for transmute access. That should ensure that
>>>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>>>> with the metadata in a detectable way.
>>>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>>>> Questions follow.
>>>>>
>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>>>> --- a/include/linux/fs.h
>>>>>>> +++ b/include/linux/fs.h
>>>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>>>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>>>> +{
>>>>>>> +	return sb->s_user_ns != &init_user_ns;
>>>>>>> +}
>>>>>>>  
>>>>>>>  extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>>>  
>>>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>>>> index a143328f75eb..591fd19294e7 100644
>>>>>>> --- a/security/smack/smack_lsm.c
>>>>>>> +++ b/security/smack/smack_lsm.c
>>>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>>>>  	char *buffer;
>>>>>>>  	struct smack_known *skp = NULL;
>>>>>>>  
>>>>>>> +	/* Should never fetch xattrs from untrusted mounts */
>>>>>>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>>>> +		return ERR_PTR(-EPERM);
>>>>>>> +
>>>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>>>
>>>>>>>  	if (ip->i_op->getxattr == NULL)
>>>>>>>  		return ERR_PTR(-EOPNOTSUPP);
>>>>>>>  
>>>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>>>>  		 */
>>>>>>>  		if (specified)
>>>>>>>  			return -EPERM;
>>>>>>> +
>>>>>>>  		/*
>>>>>>> -		 * Unprivileged mounts get root and default from the caller.
>>>>>>> +		 * User namespace mounts get root and default from the backing
>>>>>>> +		 * store, if there is one. Other unprivileged mounts get them
>>>>>>> +		 * from the caller.
>>>>>>>  		 */
>>>>>>> -		skp = smk_of_current();
>>>>>>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>>>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>>>>  		sp->smk_root = skp;
>>>>>>>  		sp->smk_default = skp;
>>>>>> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>>>> I assume that you meant skp and not sp here.
>>>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>>>> in the smk_flags field of the root inode. That's easy:
>>>>
>>>> 			transmute = 1;
>>>>
>>>> and the code after "Initialize the root inode" will take care of it.
>>> Yeah, that's what I've actually done.
>>>
>>>>>>>  	}
>>>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>>>>   */
>>>>>>>  static int smack_inode_alloc_security(struct inode *inode)
>>>>>>>  {
>>>>>>> -	struct smack_known *skp = smk_of_current();
>>>>>>> +	struct smack_known *skp;
>>>>>>> +
>>>>>>> +	if (sb_in_userns(inode->i_sb))
>>>>>>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>>>> +	else
>>>>>>> +		skp = smk_of_current();
>>>>>> This should be left alone.
>>>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>>>> legitimately result in a Rubble label on the file. It's something like
>>>>>>
>>>>>> 	... after the call may = smk_access_entry(...)
>>>>>> 	if (sb_in_userns(inode->i_sb))
>>>>>> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>>>> 			return -EACCES; 
>>>>> I'm not getting how this covers all cases.
>>>>>
>>>>> So we've set the transmute flag on the root inode. Files and directories
>>>>> created in the root directory get the same label, and directories also
>>>>> get the transmute attribute. That's all fine.
>>>>>
>>>>> What about an existing directory in the filesystem that already has a
>>>>> Slate label? I'm not getting what happens with this directory, or for
>>>>> new files created in this directory, which also relates to my other
>>>>> questions below.
>>>>>
>>>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>>>> object and subject labels. Presumably it's possible that no rule could
>>>>> be found, otherwise the prior initialization of may is pointless. If
>>>>> this happens the following code treats it as though it always contains
>>>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>>>> happens with a two's compliement representation of -ENOENT since it will
>>>>> just set a bit that's already set, but it still seems like it should
>>>>> have a may > 0 condition, for clarity if for no other reason.
>>>> My suggested code is just wrong. I wasn't looking at the whole code,
>>>> only the patch, and got myself confused. Apologies.
>>>>
>>>> If we want to go straight for the jugular how about this? I'm assuming
>>>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
>>> Yes.
>>>
>>>> static int smack_inode_permission(struct inode *inode, int mask)
>>>> {
>>>> 	struct smk_audit_info ad;
>>>> 	int no_block = mask & MAY_NOT_BLOCK;
>>>> 	int rc;
>>>>
>>>> 	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>>>> 	/*
>>>> 	 * No permission to check. Existence test. Yup, it's there.
>>>> 	 */
>>>> 	if (mask == 0)
>>>> 		return 0;
>>>>
>>>> +	if (sb_in_userns(inode->i_sb)) &&
>>>> +	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>>>> +		return -EACCES;
>>>> +
>>>> 	/* May be droppable after audit */
>>>> 	if (no_block)
>>>> 		return -ECHILD;
>>>> 	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>>>> 	smk_ad_setfield_u_fs_inode(&ad, inode);
>>>> 	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>>>> 	rc = smk_bu_inode(inode, mask, rc);
>>>> 	return rc;
>>>> }
>>> Hmm, okay. I think I've been a little confused all this time about how
>>> you want to handle these unprivileged mounts.
>> Not your problem. I'm not the most consistent of reviewers.
>>
>>> Originally I thought you wanted all objects in the filesystem to get the
>>> same label as the backing store. That's what I tried to implement
>>> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
>>> assign every object (new and existing) smk_default and completely ignore
>>> the labels on disk.
>> I want everything to have the label of the backing store, but
>> I don't want to ignore it if it somehow got something else. Because
>> the only legitimate label for this example is Rubble, I want to
>> reject anything else that appears. If someone builds a filesystem
>> by hand with Slate labels I want it treated "safely".
>>
>>> This is what I currently think you want for user ns mounts:
>>>
>>>  1. smk_root and smk_default are assigned the label of the backing
>>>     device.
>>>  2. s_root is assigned the transmute property.
>>>  3. For existing files:
>>>     a. Files with the same label as the backing device are accessible.
>>>     b. Files with any other label are not accessible.
>> That's right. Accept correct data, reject anything that's not right.
>>
>>> If this is right, there are a couple lingering questions in my mind.
>>>
>>> First, what happens with files created in directories with the same
>>> label as the backing device but without the transmute property set? The
>>> inode for the new file will initially be labeled with smk_of_current(),
>>> but then during d_instantiate it will get smk_default and thus end up
>>> with the label we want. So that seems okay.
>> Yes.
>>
>>> The second is whether files with the SMACK64EXEC attribute is still a
>>> problem. It seems it is, for files with the same label as the backing
>>> store at least. I think we can simply skip the code that reads out this
>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>> label to the new task in bprm_set_creds. The latter seems more
>>> consistent with the approach you've suggested for dealing with labels
>>> from disk.
>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> smack_d_instantiate for unprivileged mounts would do the trick.
>>
>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>> given that the user who mounted the filesystem already has full access
>>> to the backing store.
>> In truth, there is no reason to expect that the "user" who did the
>> mount will ever have a Smack label that differs from the label of
>> the backing store. If what we've got here seems restrictive, it's
>> because you've got access from someone other than the "user".
>>
>>> Please let me know whether or not this matches up with what you are
>>> thinking, then I can procede with the implementation.
>> My current mindset is that, if you're going to allow unprivileged
>> mounts of user defined backing stores, this is as safe as we can
>> make it.
> That actually sounds very reasonable to me.  It is essentially what we
> do with uid and gids already.  I presume the smack namespace support
> would when integrated with all of this would allow a set of labels to be
> set.
>
> Have I missed a part of the conversation you talk about fileystems that
> don't have support for storing labels?  Filesystems like vfat, isofs,
> etc.

They are easier. Set smackfsroot=Rubble,smackfsdef=Rubble and all objects
there will get labeled Rubble. Processes with different labels that can
write there will end up creating Rubble objects. For privileged mounts you
can set the values at will. For unprivileged mounts, you should take the
label values from the backing store.

>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23  1:51                             ` Dave Chinner
@ 2015-07-23 13:19                               ` J. Bruce Fields
  2015-07-23 23:48                                 ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: J. Bruce Fields @ 2015-07-23 13:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Austin S Hemmelgarn, Eric W. Biederman, Casey Schaufler,
	Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 23, 2015 at 11:51:35AM +1000, Dave Chinner wrote:
> On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> > On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > > >>>result in creating a cycle in the dcache and then deadlocking.
> > > >>
> > > >>Therein lies the problem: how do you detect such structural defects
> > > >>without doing a full structure validation?
> > > >
> > > >You can prevent cycles in a graph if you can prevent adding an edge
> > > >which would be part of a cycle.
> > > >
> > > Except if the user can write to the filesystem's backing storage (be
> > > it a device or a file), and has sufficient knowledge of the on-disk
> > > structures, they can create all the cycles they want in the
> > > metadata. So unless the kernel builds the graph internally by
> > > parsing the metadata _and_ has some way to detect that the on-disk
> > > metadata has hit a cycle (which may not just involve 2 items),
> > 
> > Understood.  Again, see the d_ancestor call in d_splice_alias, this is
> > exactly what it checks for.
> 
> But that only addresses one type of loop in one specific metadata
> structure.

Yep, agreed!

> There's plenty of other ways you could construct metadata
> loops that are essentially undetected and result in either deadlock
> or livelock within the filesystem code itself. e.g. just make btree
> sibling pointers loop over a range of entries that have the same
> index key (e.g. free space extents of the same size). If allocation
> then falls into this loop, the kernel will just spin searching the
> same blocks for something it will never find.  Such resource
> consumption attacks are trivial to construct but extremely difficult
> to detect because they exploit normal behaviour of the structure and
> algorithms by mangling trusted pointers.

Interesting example, thanks!  I doubt this particular example would be
*that* hard to detect?  But understood that there may be lots of others.

--b.

> 
> Of course, this sort of attack will eventually deadlock the
> filesystem because it will backs up on locks held by the live locked
> search. Once the filesystem is deadlocked, it can then cause sync()
> calls to get stuck on the filesystem. And because sync() is a global
> operation, a deadlocked filesystem in one container could cause sync
> to hang in completely unrelated container....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23  0:15                                 ` Eric W. Biederman
@ 2015-07-23  5:15                                   ` Seth Forshee
  2015-07-23 21:48                                   ` Casey Schaufler
  1 sibling, 0 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-23  5:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Andy Lutomirski, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Wed, Jul 22, 2015 at 07:15:19PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
> 
> > On 7/22/2015 12:32 PM, Seth Forshee wrote:
> >> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
> >>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
> >>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> >>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> >>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>>>>>>> I really don't see the benefit of making up extra rules that apply to
> >>>>>>>>> users outside a userns who try to access specifically a filesystem
> >>>>>>>>> with backing store.  They wouldn't make sense for filesystems without
> >>>>>>>>> backing store.
> >>>>>>>> Sure it would. For Smack, it would be the label a file would be
> >>>>>>>> created with, which would be the label of the process creating
> >>>>>>>> the memory based filesystem. For SELinux the rules are more a
> >>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>>>>>>> come up with how to determine it.
> >>>>>>>>
> >>>>>>>> The point, looping all the way back to the beginning, where we
> >>>>>>>> were talking about just ignoring the labels on the filesystem,
> >>>>>>>> is that if you use the same Smack label on the files in the
> >>>>>>>> filesystem as the backing store file has, we'll all be happy.
> >>>>>>>> If that label isn't something user can write to, he won't be
> >>>>>>>> able to write to the mounted objects, either. If there is no
> >>>>>>>> backing store then use the label of the process creating the
> >>>>>>>> filesystem, which will be the user, which will mean everything
> >>>>>>>> will work hunky dory.
> >>>>>>>>
> >>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>>>>>>> the label from the backing store or the creating process is
> >>>>>>>> simple enough.
> >>>>>>>>
> >>>>>> So something like the diff below (untested)?
> >>>>> I think that this is close, and quite good for someone
> >>>>> who isn't very familiar with Smack. It's definitely headed
> >>>>> in the right direction.
> >>>>>
> >>>>>> All I'm really doing is setting smk_default as you describe above and
> >>>>>> then using it instead of smk_of_current() in
> >>>>>> smack_inode_alloc_security() and instead of the label from the disk in
> >>>>>> smack_d_instantiate().
> >>>>> Let's say your backing store is a file labeled Rubble.
> >>>>>
> >>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
> >>>>>
> >>>>> It is completely reasonable for a process labeled Flintstone to
> >>>>> have rwxa access to a file labeled Rubble.
> >>>>>
> >>>>> Smack rule: Flintstone Rubble rwxa
> >>>>>
> >>>>> In the case of writing to an existing Rubble file, what you
> >>>>> have looks fine. What's not so great is that if the Flintstone
> >>>>> process creates a file, it should be labeled Flintstone. Your
> >>>>> use of the smk_default, which is going to violate the principle
> >>>>> of least astonishment, and break the Smack policy as well.
> >>>>>
> >>>>> Let's make a minor change. Instead of using smackfsroot let's
> >>>>> use smackfstransmute and a slightly different access rule:
> >>>>>
> >>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
> >>>>>
> >>>>> Smack rule: Flintstone Rubble rwxat
> >>>>>
> >>>>> Now the only change we have to make to the Smack code is
> >>>>> that we don't want to create any files unless either the
> >>>>> process is labeled Rubble or the rule allowing the creation
> >>>>> has the "t" for transmute access. That should ensure that
> >>>>> everything is labeled Rubble. If it isn't, someone has mucked
> >>>>> with the metadata in a detectable way.
> >>>> All right, that kind of makes sense, but I'm still missing some pieces.
> >>>> Questions follow.
> >>>>
> >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>> index 32f598db0b0d..4597420ab933 100644
> >>>>>> --- a/include/linux/fs.h
> >>>>>> +++ b/include/linux/fs.h
> >>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> >>>>>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
> >>>>>>  }
> >>>>>>  
> >>>>>> +static inline bool sb_in_userns(struct super_block *sb)
> >>>>>> +{
> >>>>>> +	return sb->s_user_ns != &init_user_ns;
> >>>>>> +}
> >>>>>>  
> >>>>>>  extern bool inode_owner_or_capable(const struct inode *inode);
> >>>>>>  
> >>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> >>>>>> index a143328f75eb..591fd19294e7 100644
> >>>>>> --- a/security/smack/smack_lsm.c
> >>>>>> +++ b/security/smack/smack_lsm.c
> >>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> >>>>>>  	char *buffer;
> >>>>>>  	struct smack_known *skp = NULL;
> >>>>>>  
> >>>>>> +	/* Should never fetch xattrs from untrusted mounts */
> >>>>>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
> >>>>>> +		return ERR_PTR(-EPERM);
> >>>>>> +
> >>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
> >>>>>
> >>>>>>  	if (ip->i_op->getxattr == NULL)
> >>>>>>  		return ERR_PTR(-EOPNOTSUPP);
> >>>>>>  
> >>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >>>>>>  		 */
> >>>>>>  		if (specified)
> >>>>>>  			return -EPERM;
> >>>>>> +
> >>>>>>  		/*
> >>>>>> -		 * Unprivileged mounts get root and default from the caller.
> >>>>>> +		 * User namespace mounts get root and default from the backing
> >>>>>> +		 * store, if there is one. Other unprivileged mounts get them
> >>>>>> +		 * from the caller.
> >>>>>>  		 */
> >>>>>> -		skp = smk_of_current();
> >>>>>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
> >>>>>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> >>>>>>  		sp->smk_root = skp;
> >>>>>>  		sp->smk_default = skp;
> >>>>> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
> >>>> I assume that you meant skp and not sp here.
> >>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
> >>> in the smk_flags field of the root inode. That's easy:
> >>>
> >>> 			transmute = 1;
> >>>
> >>> and the code after "Initialize the root inode" will take care of it.
> >> Yeah, that's what I've actually done.
> >>
> >>>>>>  	}
> >>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> >>>>>>   */
> >>>>>>  static int smack_inode_alloc_security(struct inode *inode)
> >>>>>>  {
> >>>>>> -	struct smack_known *skp = smk_of_current();
> >>>>>> +	struct smack_known *skp;
> >>>>>> +
> >>>>>> +	if (sb_in_userns(inode->i_sb))
> >>>>>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> >>>>>> +	else
> >>>>>> +		skp = smk_of_current();
> >>>>> This should be left alone.
> >>>>> smack_inode_init_security is where you could disallow access that doesn't
> >>>>> legitimately result in a Rubble label on the file. It's something like
> >>>>>
> >>>>> 	... after the call may = smk_access_entry(...)
> >>>>> 	if (sb_in_userns(inode->i_sb))
> >>>>> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> >>>>> 			return -EACCES; 
> >>>> I'm not getting how this covers all cases.
> >>>>
> >>>> So we've set the transmute flag on the root inode. Files and directories
> >>>> created in the root directory get the same label, and directories also
> >>>> get the transmute attribute. That's all fine.
> >>>>
> >>>> What about an existing directory in the filesystem that already has a
> >>>> Slate label? I'm not getting what happens with this directory, or for
> >>>> new files created in this directory, which also relates to my other
> >>>> questions below.
> >>>>
> >>>> Also an aside - smk_access_entry looks weird. may is initialized to
> >>>> -ENOENT, and then rule_list is searched for a rule which matches the
> >>>> object and subject labels. Presumably it's possible that no rule could
> >>>> be found, otherwise the prior initialization of may is pointless. If
> >>>> this happens the following code treats it as though it always contains
> >>>> access flags even though it might contain -ENOENT. Nothing bad actually
> >>>> happens with a two's compliement representation of -ENOENT since it will
> >>>> just set a bit that's already set, but it still seems like it should
> >>>> have a may > 0 condition, for clarity if for no other reason.
> >>> My suggested code is just wrong. I wasn't looking at the whole code,
> >>> only the patch, and got myself confused. Apologies.
> >>>
> >>> If we want to go straight for the jugular how about this? I'm assuming
> >>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
> >> Yes.
> >>
> >>> static int smack_inode_permission(struct inode *inode, int mask)
> >>> {
> >>> 	struct smk_audit_info ad;
> >>> 	int no_block = mask & MAY_NOT_BLOCK;
> >>> 	int rc;
> >>>
> >>> 	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
> >>> 	/*
> >>> 	 * No permission to check. Existence test. Yup, it's there.
> >>> 	 */
> >>> 	if (mask == 0)
> >>> 		return 0;
> >>>
> >>> +	if (sb_in_userns(inode->i_sb)) &&
> >>> +	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
> >>> +		return -EACCES;
> >>> +
> >>> 	/* May be droppable after audit */
> >>> 	if (no_block)
> >>> 		return -ECHILD;
> >>> 	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
> >>> 	smk_ad_setfield_u_fs_inode(&ad, inode);
> >>> 	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
> >>> 	rc = smk_bu_inode(inode, mask, rc);
> >>> 	return rc;
> >>> }
> >> Hmm, okay. I think I've been a little confused all this time about how
> >> you want to handle these unprivileged mounts.
> >
> > Not your problem. I'm not the most consistent of reviewers.
> >
> >> Originally I thought you wanted all objects in the filesystem to get the
> >> same label as the backing store. That's what I tried to implement
> >> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
> >> assign every object (new and existing) smk_default and completely ignore
> >> the labels on disk.
> >
> > I want everything to have the label of the backing store, but
> > I don't want to ignore it if it somehow got something else. Because
> > the only legitimate label for this example is Rubble, I want to
> > reject anything else that appears. If someone builds a filesystem
> > by hand with Slate labels I want it treated "safely".
> >
> >> This is what I currently think you want for user ns mounts:
> >>
> >>  1. smk_root and smk_default are assigned the label of the backing
> >>     device.
> >>  2. s_root is assigned the transmute property.
> >>  3. For existing files:
> >>     a. Files with the same label as the backing device are accessible.
> >>     b. Files with any other label are not accessible.
> >
> > That's right. Accept correct data, reject anything that's not right.
> >
> >> If this is right, there are a couple lingering questions in my mind.
> >>
> >> First, what happens with files created in directories with the same
> >> label as the backing device but without the transmute property set? The
> >> inode for the new file will initially be labeled with smk_of_current(),
> >> but then during d_instantiate it will get smk_default and thus end up
> >> with the label we want. So that seems okay.
> >
> > Yes.
> >
> >> The second is whether files with the SMACK64EXEC attribute is still a
> >> problem. It seems it is, for files with the same label as the backing
> >> store at least. I think we can simply skip the code that reads out this
> >> xattr and sets smk_task for user ns mounts, or else skip assigning the
> >> label to the new task in bprm_set_creds. The latter seems more
> >> consistent with the approach you've suggested for dealing with labels
> >> from disk.
> >
> > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > smack_d_instantiate for unprivileged mounts would do the trick.
> >
> >> So I guess all of that seems okay, though perhaps a bit restrictive
> >> given that the user who mounted the filesystem already has full access
> >> to the backing store.
> >
> > In truth, there is no reason to expect that the "user" who did the
> > mount will ever have a Smack label that differs from the label of
> > the backing store. If what we've got here seems restrictive, it's
> > because you've got access from someone other than the "user".
> >
> >> Please let me know whether or not this matches up with what you are
> >> thinking, then I can procede with the implementation.
> >
> > My current mindset is that, if you're going to allow unprivileged
> > mounts of user defined backing stores, this is as safe as we can
> > make it.
> 
> That actually sounds very reasonable to me.  It is essentially what we
> do with uid and gids already.  I presume the smack namespace support
> would when integrated with all of this would allow a set of labels to be
> set.
> 
> Have I missed a part of the conversation you talk about fileystems that
> don't have support for storing labels?  Filesystems like vfat, isofs,
> etc.

As I read the code they should all end up with the superblock's
smk_default label for the objects in RAM, i.e. the label of the backing
store. The same would be true for existing files in a filesystem which
does support storing labels but has no labels on the files.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 17:41                           ` J. Bruce Fields
@ 2015-07-23  1:51                             ` Dave Chinner
  2015-07-23 13:19                               ` J. Bruce Fields
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-07-23  1:51 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Austin S Hemmelgarn, Eric W. Biederman, Casey Schaufler,
	Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > >>>result in creating a cycle in the dcache and then deadlocking.
> > >>
> > >>Therein lies the problem: how do you detect such structural defects
> > >>without doing a full structure validation?
> > >
> > >You can prevent cycles in a graph if you can prevent adding an edge
> > >which would be part of a cycle.
> > >
> > Except if the user can write to the filesystem's backing storage (be
> > it a device or a file), and has sufficient knowledge of the on-disk
> > structures, they can create all the cycles they want in the
> > metadata. So unless the kernel builds the graph internally by
> > parsing the metadata _and_ has some way to detect that the on-disk
> > metadata has hit a cycle (which may not just involve 2 items),
> 
> Understood.  Again, see the d_ancestor call in d_splice_alias, this is
> exactly what it checks for.

But that only addresses one type of loop in one specific metadata
structure. There's plenty of other ways you could construct metadata
loops that are essentially undetected and result in either deadlock
or livelock within the filesystem code itself. e.g. just make btree
sibling pointers loop over a range of entries that have the same
index key (e.g. free space extents of the same size). If allocation
then falls into this loop, the kernel will just spin searching the
same blocks for something it will never find.  Such resource
consumption attacks are trivial to construct but extremely difficult
to detect because they exploit normal behaviour of the structure and
algorithms by mangling trusted pointers.

Of course, this sort of attack will eventually deadlock the
filesystem because it will backs up on locks held by the live locked
search. Once the filesystem is deadlocked, it can then cause sync()
calls to get stuck on the filesystem. And because sync() is a global
operation, a deadlocked filesystem in one container could cause sync
to hang in completely unrelated container....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-23  0:05                               ` Casey Schaufler
@ 2015-07-23  0:15                                 ` Eric W. Biederman
  2015-07-23  5:15                                   ` Seth Forshee
  2015-07-23 21:48                                   ` Casey Schaufler
  2015-07-28 20:40                                 ` Seth Forshee
  1 sibling, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-23  0:15 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Andy Lutomirski, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

Casey Schaufler <casey@schaufler-ca.com> writes:

> On 7/22/2015 12:32 PM, Seth Forshee wrote:
>> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>>> with backing store.  They wouldn't make sense for filesystems without
>>>>>>>>> backing store.
>>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>>> created with, which would be the label of the process creating
>>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>>> come up with how to determine it.
>>>>>>>>
>>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>>> backing store then use the label of the process creating the
>>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>>> will work hunky dory.
>>>>>>>>
>>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>>> the label from the backing store or the creating process is
>>>>>>>> simple enough.
>>>>>>>>
>>>>>> So something like the diff below (untested)?
>>>>> I think that this is close, and quite good for someone
>>>>> who isn't very familiar with Smack. It's definitely headed
>>>>> in the right direction.
>>>>>
>>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>>> then using it instead of smk_of_current() in
>>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>>> smack_d_instantiate().
>>>>> Let's say your backing store is a file labeled Rubble.
>>>>>
>>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>>
>>>>> It is completely reasonable for a process labeled Flintstone to
>>>>> have rwxa access to a file labeled Rubble.
>>>>>
>>>>> Smack rule: Flintstone Rubble rwxa
>>>>>
>>>>> In the case of writing to an existing Rubble file, what you
>>>>> have looks fine. What's not so great is that if the Flintstone
>>>>> process creates a file, it should be labeled Flintstone. Your
>>>>> use of the smk_default, which is going to violate the principle
>>>>> of least astonishment, and break the Smack policy as well.
>>>>>
>>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>>> use smackfstransmute and a slightly different access rule:
>>>>>
>>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>>
>>>>> Smack rule: Flintstone Rubble rwxat
>>>>>
>>>>> Now the only change we have to make to the Smack code is
>>>>> that we don't want to create any files unless either the
>>>>> process is labeled Rubble or the rule allowing the creation
>>>>> has the "t" for transmute access. That should ensure that
>>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>>> with the metadata in a detectable way.
>>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>>> Questions follow.
>>>>
>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>>> --- a/include/linux/fs.h
>>>>>> +++ b/include/linux/fs.h
>>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
>>>>>>  }
>>>>>>  
>>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>>> +{
>>>>>> +	return sb->s_user_ns != &init_user_ns;
>>>>>> +}
>>>>>>  
>>>>>>  extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>>  
>>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>>> index a143328f75eb..591fd19294e7 100644
>>>>>> --- a/security/smack/smack_lsm.c
>>>>>> +++ b/security/smack/smack_lsm.c
>>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>>>  	char *buffer;
>>>>>>  	struct smack_known *skp = NULL;
>>>>>>  
>>>>>> +	/* Should never fetch xattrs from untrusted mounts */
>>>>>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>>> +		return ERR_PTR(-EPERM);
>>>>>> +
>>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>>
>>>>>>  	if (ip->i_op->getxattr == NULL)
>>>>>>  		return ERR_PTR(-EOPNOTSUPP);
>>>>>>  
>>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>>>  		 */
>>>>>>  		if (specified)
>>>>>>  			return -EPERM;
>>>>>> +
>>>>>>  		/*
>>>>>> -		 * Unprivileged mounts get root and default from the caller.
>>>>>> +		 * User namespace mounts get root and default from the backing
>>>>>> +		 * store, if there is one. Other unprivileged mounts get them
>>>>>> +		 * from the caller.
>>>>>>  		 */
>>>>>> -		skp = smk_of_current();
>>>>>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>>>  		sp->smk_root = skp;
>>>>>>  		sp->smk_default = skp;
>>>>> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>>> I assume that you meant skp and not sp here.
>>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>>> in the smk_flags field of the root inode. That's easy:
>>>
>>> 			transmute = 1;
>>>
>>> and the code after "Initialize the root inode" will take care of it.
>> Yeah, that's what I've actually done.
>>
>>>>>>  	}
>>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>>>   */
>>>>>>  static int smack_inode_alloc_security(struct inode *inode)
>>>>>>  {
>>>>>> -	struct smack_known *skp = smk_of_current();
>>>>>> +	struct smack_known *skp;
>>>>>> +
>>>>>> +	if (sb_in_userns(inode->i_sb))
>>>>>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>>> +	else
>>>>>> +		skp = smk_of_current();
>>>>> This should be left alone.
>>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>>> legitimately result in a Rubble label on the file. It's something like
>>>>>
>>>>> 	... after the call may = smk_access_entry(...)
>>>>> 	if (sb_in_userns(inode->i_sb))
>>>>> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>>> 			return -EACCES; 
>>>> I'm not getting how this covers all cases.
>>>>
>>>> So we've set the transmute flag on the root inode. Files and directories
>>>> created in the root directory get the same label, and directories also
>>>> get the transmute attribute. That's all fine.
>>>>
>>>> What about an existing directory in the filesystem that already has a
>>>> Slate label? I'm not getting what happens with this directory, or for
>>>> new files created in this directory, which also relates to my other
>>>> questions below.
>>>>
>>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>>> object and subject labels. Presumably it's possible that no rule could
>>>> be found, otherwise the prior initialization of may is pointless. If
>>>> this happens the following code treats it as though it always contains
>>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>>> happens with a two's compliement representation of -ENOENT since it will
>>>> just set a bit that's already set, but it still seems like it should
>>>> have a may > 0 condition, for clarity if for no other reason.
>>> My suggested code is just wrong. I wasn't looking at the whole code,
>>> only the patch, and got myself confused. Apologies.
>>>
>>> If we want to go straight for the jugular how about this? I'm assuming
>>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
>> Yes.
>>
>>> static int smack_inode_permission(struct inode *inode, int mask)
>>> {
>>> 	struct smk_audit_info ad;
>>> 	int no_block = mask & MAY_NOT_BLOCK;
>>> 	int rc;
>>>
>>> 	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>>> 	/*
>>> 	 * No permission to check. Existence test. Yup, it's there.
>>> 	 */
>>> 	if (mask == 0)
>>> 		return 0;
>>>
>>> +	if (sb_in_userns(inode->i_sb)) &&
>>> +	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>>> +		return -EACCES;
>>> +
>>> 	/* May be droppable after audit */
>>> 	if (no_block)
>>> 		return -ECHILD;
>>> 	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>>> 	smk_ad_setfield_u_fs_inode(&ad, inode);
>>> 	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>>> 	rc = smk_bu_inode(inode, mask, rc);
>>> 	return rc;
>>> }
>> Hmm, okay. I think I've been a little confused all this time about how
>> you want to handle these unprivileged mounts.
>
> Not your problem. I'm not the most consistent of reviewers.
>
>> Originally I thought you wanted all objects in the filesystem to get the
>> same label as the backing store. That's what I tried to implement
>> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
>> assign every object (new and existing) smk_default and completely ignore
>> the labels on disk.
>
> I want everything to have the label of the backing store, but
> I don't want to ignore it if it somehow got something else. Because
> the only legitimate label for this example is Rubble, I want to
> reject anything else that appears. If someone builds a filesystem
> by hand with Slate labels I want it treated "safely".
>
>> This is what I currently think you want for user ns mounts:
>>
>>  1. smk_root and smk_default are assigned the label of the backing
>>     device.
>>  2. s_root is assigned the transmute property.
>>  3. For existing files:
>>     a. Files with the same label as the backing device are accessible.
>>     b. Files with any other label are not accessible.
>
> That's right. Accept correct data, reject anything that's not right.
>
>> If this is right, there are a couple lingering questions in my mind.
>>
>> First, what happens with files created in directories with the same
>> label as the backing device but without the transmute property set? The
>> inode for the new file will initially be labeled with smk_of_current(),
>> but then during d_instantiate it will get smk_default and thus end up
>> with the label we want. So that seems okay.
>
> Yes.
>
>> The second is whether files with the SMACK64EXEC attribute is still a
>> problem. It seems it is, for files with the same label as the backing
>> store at least. I think we can simply skip the code that reads out this
>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>> label to the new task in bprm_set_creds. The latter seems more
>> consistent with the approach you've suggested for dealing with labels
>> from disk.
>
> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> smack_d_instantiate for unprivileged mounts would do the trick.
>
>> So I guess all of that seems okay, though perhaps a bit restrictive
>> given that the user who mounted the filesystem already has full access
>> to the backing store.
>
> In truth, there is no reason to expect that the "user" who did the
> mount will ever have a Smack label that differs from the label of
> the backing store. If what we've got here seems restrictive, it's
> because you've got access from someone other than the "user".
>
>> Please let me know whether or not this matches up with what you are
>> thinking, then I can procede with the implementation.
>
> My current mindset is that, if you're going to allow unprivileged
> mounts of user defined backing stores, this is as safe as we can
> make it.

That actually sounds very reasonable to me.  It is essentially what we
do with uid and gids already.  I presume the smack namespace support
would when integrated with all of this would allow a set of labels to be
set.

Have I missed a part of the conversation you talk about fileystems that
don't have support for storing labels?  Filesystems like vfat, isofs,
etc.

Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 19:32                             ` Seth Forshee
@ 2015-07-23  0:05                               ` Casey Schaufler
  2015-07-23  0:15                                 ` Eric W. Biederman
  2015-07-28 20:40                                 ` Seth Forshee
  0 siblings, 2 replies; 69+ messages in thread
From: Casey Schaufler @ 2015-07-23  0:05 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On 7/22/2015 12:32 PM, Seth Forshee wrote:
> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>> with backing store.  They wouldn't make sense for filesystems without
>>>>>>>> backing store.
>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>> created with, which would be the label of the process creating
>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>> come up with how to determine it.
>>>>>>>
>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>> backing store then use the label of the process creating the
>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>> will work hunky dory.
>>>>>>>
>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>> the label from the backing store or the creating process is
>>>>>>> simple enough.
>>>>>>>
>>>>> So something like the diff below (untested)?
>>>> I think that this is close, and quite good for someone
>>>> who isn't very familiar with Smack. It's definitely headed
>>>> in the right direction.
>>>>
>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>> then using it instead of smk_of_current() in
>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>> smack_d_instantiate().
>>>> Let's say your backing store is a file labeled Rubble.
>>>>
>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>
>>>> It is completely reasonable for a process labeled Flintstone to
>>>> have rwxa access to a file labeled Rubble.
>>>>
>>>> Smack rule: Flintstone Rubble rwxa
>>>>
>>>> In the case of writing to an existing Rubble file, what you
>>>> have looks fine. What's not so great is that if the Flintstone
>>>> process creates a file, it should be labeled Flintstone. Your
>>>> use of the smk_default, which is going to violate the principle
>>>> of least astonishment, and break the Smack policy as well.
>>>>
>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>> use smackfstransmute and a slightly different access rule:
>>>>
>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>
>>>> Smack rule: Flintstone Rubble rwxat
>>>>
>>>> Now the only change we have to make to the Smack code is
>>>> that we don't want to create any files unless either the
>>>> process is labeled Rubble or the rule allowing the creation
>>>> has the "t" for transmute access. That should ensure that
>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>> with the metadata in a detectable way.
>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>> Questions follow.
>>>
>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>> --- a/include/linux/fs.h
>>>>> +++ b/include/linux/fs.h
>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
>>>>>  }
>>>>>  
>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>> +{
>>>>> +	return sb->s_user_ns != &init_user_ns;
>>>>> +}
>>>>>  
>>>>>  extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>  
>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>> index a143328f75eb..591fd19294e7 100644
>>>>> --- a/security/smack/smack_lsm.c
>>>>> +++ b/security/smack/smack_lsm.c
>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>>  	char *buffer;
>>>>>  	struct smack_known *skp = NULL;
>>>>>  
>>>>> +	/* Should never fetch xattrs from untrusted mounts */
>>>>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>> +		return ERR_PTR(-EPERM);
>>>>> +
>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>
>>>>>  	if (ip->i_op->getxattr == NULL)
>>>>>  		return ERR_PTR(-EOPNOTSUPP);
>>>>>  
>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>>  		 */
>>>>>  		if (specified)
>>>>>  			return -EPERM;
>>>>> +
>>>>>  		/*
>>>>> -		 * Unprivileged mounts get root and default from the caller.
>>>>> +		 * User namespace mounts get root and default from the backing
>>>>> +		 * store, if there is one. Other unprivileged mounts get them
>>>>> +		 * from the caller.
>>>>>  		 */
>>>>> -		skp = smk_of_current();
>>>>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>>  		sp->smk_root = skp;
>>>>>  		sp->smk_default = skp;
>>>> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>> I assume that you meant skp and not sp here.
>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>> in the smk_flags field of the root inode. That's easy:
>>
>> 			transmute = 1;
>>
>> and the code after "Initialize the root inode" will take care of it.
> Yeah, that's what I've actually done.
>
>>>>>  	}
>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>>   */
>>>>>  static int smack_inode_alloc_security(struct inode *inode)
>>>>>  {
>>>>> -	struct smack_known *skp = smk_of_current();
>>>>> +	struct smack_known *skp;
>>>>> +
>>>>> +	if (sb_in_userns(inode->i_sb))
>>>>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>> +	else
>>>>> +		skp = smk_of_current();
>>>> This should be left alone.
>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>> legitimately result in a Rubble label on the file. It's something like
>>>>
>>>> 	... after the call may = smk_access_entry(...)
>>>> 	if (sb_in_userns(inode->i_sb))
>>>> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>> 			return -EACCES; 
>>> I'm not getting how this covers all cases.
>>>
>>> So we've set the transmute flag on the root inode. Files and directories
>>> created in the root directory get the same label, and directories also
>>> get the transmute attribute. That's all fine.
>>>
>>> What about an existing directory in the filesystem that already has a
>>> Slate label? I'm not getting what happens with this directory, or for
>>> new files created in this directory, which also relates to my other
>>> questions below.
>>>
>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>> object and subject labels. Presumably it's possible that no rule could
>>> be found, otherwise the prior initialization of may is pointless. If
>>> this happens the following code treats it as though it always contains
>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>> happens with a two's compliement representation of -ENOENT since it will
>>> just set a bit that's already set, but it still seems like it should
>>> have a may > 0 condition, for clarity if for no other reason.
>> My suggested code is just wrong. I wasn't looking at the whole code,
>> only the patch, and got myself confused. Apologies.
>>
>> If we want to go straight for the jugular how about this? I'm assuming
>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
> Yes.
>
>> static int smack_inode_permission(struct inode *inode, int mask)
>> {
>> 	struct smk_audit_info ad;
>> 	int no_block = mask & MAY_NOT_BLOCK;
>> 	int rc;
>>
>> 	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>> 	/*
>> 	 * No permission to check. Existence test. Yup, it's there.
>> 	 */
>> 	if (mask == 0)
>> 		return 0;
>>
>> +	if (sb_in_userns(inode->i_sb)) &&
>> +	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>> +		return -EACCES;
>> +
>> 	/* May be droppable after audit */
>> 	if (no_block)
>> 		return -ECHILD;
>> 	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>> 	smk_ad_setfield_u_fs_inode(&ad, inode);
>> 	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>> 	rc = smk_bu_inode(inode, mask, rc);
>> 	return rc;
>> }
> Hmm, okay. I think I've been a little confused all this time about how
> you want to handle these unprivileged mounts.

Not your problem. I'm not the most consistent of reviewers.

> Originally I thought you wanted all objects in the filesystem to get the
> same label as the backing store. That's what I tried to implement
> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
> assign every object (new and existing) smk_default and completely ignore
> the labels on disk.

I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".

> This is what I currently think you want for user ns mounts:
>
>  1. smk_root and smk_default are assigned the label of the backing
>     device.
>  2. s_root is assigned the transmute property.
>  3. For existing files:
>     a. Files with the same label as the backing device are accessible.
>     b. Files with any other label are not accessible.

That's right. Accept correct data, reject anything that's not right.

> If this is right, there are a couple lingering questions in my mind.
>
> First, what happens with files created in directories with the same
> label as the backing device but without the transmute property set? The
> inode for the new file will initially be labeled with smk_of_current(),
> but then during d_instantiate it will get smk_default and thus end up
> with the label we want. So that seems okay.

Yes.

> The second is whether files with the SMACK64EXEC attribute is still a
> problem. It seems it is, for files with the same label as the backing
> store at least. I think we can simply skip the code that reads out this
> xattr and sets smk_task for user ns mounts, or else skip assigning the
> label to the new task in bprm_set_creds. The latter seems more
> consistent with the approach you've suggested for dealing with labels
> from disk.

Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.

> So I guess all of that seems okay, though perhaps a bit restrictive
> given that the user who mounted the filesystem already has full access
> to the backing store.

In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".

> Please let me know whether or not this matches up with what you are
> thinking, then I can procede with the implementation.

My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.

>
> Thanks,
> Seth
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 18:10                           ` Casey Schaufler
@ 2015-07-22 19:32                             ` Seth Forshee
  2015-07-23  0:05                               ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-22 19:32 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
> On 7/22/2015 8:56 AM, Seth Forshee wrote:
> > On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> >> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> >>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>>>> I really don't see the benefit of making up extra rules that apply to
> >>>>>> users outside a userns who try to access specifically a filesystem
> >>>>>> with backing store.  They wouldn't make sense for filesystems without
> >>>>>> backing store.
> >>>>> Sure it would. For Smack, it would be the label a file would be
> >>>>> created with, which would be the label of the process creating
> >>>>> the memory based filesystem. For SELinux the rules are more a
> >>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>>>> come up with how to determine it.
> >>>>>
> >>>>> The point, looping all the way back to the beginning, where we
> >>>>> were talking about just ignoring the labels on the filesystem,
> >>>>> is that if you use the same Smack label on the files in the
> >>>>> filesystem as the backing store file has, we'll all be happy.
> >>>>> If that label isn't something user can write to, he won't be
> >>>>> able to write to the mounted objects, either. If there is no
> >>>>> backing store then use the label of the process creating the
> >>>>> filesystem, which will be the user, which will mean everything
> >>>>> will work hunky dory.
> >>>>>
> >>>>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>>>> the label from the backing store or the creating process is
> >>>>> simple enough.
> >>>>>
> >>> So something like the diff below (untested)?
> >> I think that this is close, and quite good for someone
> >> who isn't very familiar with Smack. It's definitely headed
> >> in the right direction.
> >>
> >>> All I'm really doing is setting smk_default as you describe above and
> >>> then using it instead of smk_of_current() in
> >>> smack_inode_alloc_security() and instead of the label from the disk in
> >>> smack_d_instantiate().
> >> Let's say your backing store is a file labeled Rubble.
> >>
> >> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
> >>
> >> It is completely reasonable for a process labeled Flintstone to
> >> have rwxa access to a file labeled Rubble.
> >>
> >> Smack rule: Flintstone Rubble rwxa
> >>
> >> In the case of writing to an existing Rubble file, what you
> >> have looks fine. What's not so great is that if the Flintstone
> >> process creates a file, it should be labeled Flintstone. Your
> >> use of the smk_default, which is going to violate the principle
> >> of least astonishment, and break the Smack policy as well.
> >>
> >> Let's make a minor change. Instead of using smackfsroot let's
> >> use smackfstransmute and a slightly different access rule:
> >>
> >> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
> >>
> >> Smack rule: Flintstone Rubble rwxat
> >>
> >> Now the only change we have to make to the Smack code is
> >> that we don't want to create any files unless either the
> >> process is labeled Rubble or the rule allowing the creation
> >> has the "t" for transmute access. That should ensure that
> >> everything is labeled Rubble. If it isn't, someone has mucked
> >> with the metadata in a detectable way.
> > All right, that kind of makes sense, but I'm still missing some pieces.
> > Questions follow.
> >
> >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>> index 32f598db0b0d..4597420ab933 100644
> >>> --- a/include/linux/fs.h
> >>> +++ b/include/linux/fs.h
> >>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> >>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
> >>>  }
> >>>  
> >>> +static inline bool sb_in_userns(struct super_block *sb)
> >>> +{
> >>> +	return sb->s_user_ns != &init_user_ns;
> >>> +}
> >>>  
> >>>  extern bool inode_owner_or_capable(const struct inode *inode);
> >>>  
> >>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> >>> index a143328f75eb..591fd19294e7 100644
> >>> --- a/security/smack/smack_lsm.c
> >>> +++ b/security/smack/smack_lsm.c
> >>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> >>>  	char *buffer;
> >>>  	struct smack_known *skp = NULL;
> >>>  
> >>> +	/* Should never fetch xattrs from untrusted mounts */
> >>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
> >>> +		return ERR_PTR(-EPERM);
> >>> +
> >> Go ahead and fetch it, we'll check to make sure it's viable later.
> >>
> >>>  	if (ip->i_op->getxattr == NULL)
> >>>  		return ERR_PTR(-EOPNOTSUPP);
> >>>  
> >>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >>>  		 */
> >>>  		if (specified)
> >>>  			return -EPERM;
> >>> +
> >>>  		/*
> >>> -		 * Unprivileged mounts get root and default from the caller.
> >>> +		 * User namespace mounts get root and default from the backing
> >>> +		 * store, if there is one. Other unprivileged mounts get them
> >>> +		 * from the caller.
> >>>  		 */
> >>> -		skp = smk_of_current();
> >>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
> >>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> >>>  		sp->smk_root = skp;
> >>>  		sp->smk_default = skp;
> >> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
> > I assume that you meant skp and not sp here.
> 
> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
> in the smk_flags field of the root inode. That's easy:
> 
> 			transmute = 1;
> 
> and the code after "Initialize the root inode" will take care of it.

Yeah, that's what I've actually done.

> >>>  	}
> >>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> >>>   */
> >>>  static int smack_inode_alloc_security(struct inode *inode)
> >>>  {
> >>> -	struct smack_known *skp = smk_of_current();
> >>> +	struct smack_known *skp;
> >>> +
> >>> +	if (sb_in_userns(inode->i_sb))
> >>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> >>> +	else
> >>> +		skp = smk_of_current();
> >> This should be left alone.
> >> smack_inode_init_security is where you could disallow access that doesn't
> >> legitimately result in a Rubble label on the file. It's something like
> >>
> >> 	... after the call may = smk_access_entry(...)
> >> 	if (sb_in_userns(inode->i_sb))
> >> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> >> 			return -EACCES; 
> > I'm not getting how this covers all cases.
> >
> > So we've set the transmute flag on the root inode. Files and directories
> > created in the root directory get the same label, and directories also
> > get the transmute attribute. That's all fine.
> >
> > What about an existing directory in the filesystem that already has a
> > Slate label? I'm not getting what happens with this directory, or for
> > new files created in this directory, which also relates to my other
> > questions below.
> >
> > Also an aside - smk_access_entry looks weird. may is initialized to
> > -ENOENT, and then rule_list is searched for a rule which matches the
> > object and subject labels. Presumably it's possible that no rule could
> > be found, otherwise the prior initialization of may is pointless. If
> > this happens the following code treats it as though it always contains
> > access flags even though it might contain -ENOENT. Nothing bad actually
> > happens with a two's compliement representation of -ENOENT since it will
> > just set a bit that's already set, but it still seems like it should
> > have a may > 0 condition, for clarity if for no other reason.
> 
> My suggested code is just wrong. I wasn't looking at the whole code,
> only the patch, and got myself confused. Apologies.
> 
> If we want to go straight for the jugular how about this? I'm assuming
> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.

Yes.

> static int smack_inode_permission(struct inode *inode, int mask)
> {
> 	struct smk_audit_info ad;
> 	int no_block = mask & MAY_NOT_BLOCK;
> 	int rc;
> 
> 	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
> 	/*
> 	 * No permission to check. Existence test. Yup, it's there.
> 	 */
> 	if (mask == 0)
> 		return 0;
> 
> +	if (sb_in_userns(inode->i_sb)) &&
> +	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
> +		return -EACCES;
> +
> 	/* May be droppable after audit */
> 	if (no_block)
> 		return -ECHILD;
> 	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
> 	smk_ad_setfield_u_fs_inode(&ad, inode);
> 	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
> 	rc = smk_bu_inode(inode, mask, rc);
> 	return rc;
> }

Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.

Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.

This is what I currently think you want for user ns mounts:

 1. smk_root and smk_default are assigned the label of the backing
    device.
 2. s_root is assigned the transmute property.
 3. For existing files:
    a. Files with the same label as the backing device are accessible.
    b. Files with any other label are not accessible.

If this is right, there are a couple lingering questions in my mind.

First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.

The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.

So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.

Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.

Thanks,
Seth


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 15:56                         ` Seth Forshee
@ 2015-07-22 18:10                           ` Casey Schaufler
  2015-07-22 19:32                             ` Seth Forshee
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-22 18:10 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel, Casey Schaufler

On 7/22/2015 8:56 AM, Seth Forshee wrote:
> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>> with backing store.  They wouldn't make sense for filesystems without
>>>>>> backing store.
>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>> created with, which would be the label of the process creating
>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>> come up with how to determine it.
>>>>>
>>>>> The point, looping all the way back to the beginning, where we
>>>>> were talking about just ignoring the labels on the filesystem,
>>>>> is that if you use the same Smack label on the files in the
>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>> If that label isn't something user can write to, he won't be
>>>>> able to write to the mounted objects, either. If there is no
>>>>> backing store then use the label of the process creating the
>>>>> filesystem, which will be the user, which will mean everything
>>>>> will work hunky dory.
>>>>>
>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>> the label from the backing store or the creating process is
>>>>> simple enough.
>>>>>
>>> So something like the diff below (untested)?
>> I think that this is close, and quite good for someone
>> who isn't very familiar with Smack. It's definitely headed
>> in the right direction.
>>
>>> All I'm really doing is setting smk_default as you describe above and
>>> then using it instead of smk_of_current() in
>>> smack_inode_alloc_security() and instead of the label from the disk in
>>> smack_d_instantiate().
>> Let's say your backing store is a file labeled Rubble.
>>
>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>
>> It is completely reasonable for a process labeled Flintstone to
>> have rwxa access to a file labeled Rubble.
>>
>> Smack rule: Flintstone Rubble rwxa
>>
>> In the case of writing to an existing Rubble file, what you
>> have looks fine. What's not so great is that if the Flintstone
>> process creates a file, it should be labeled Flintstone. Your
>> use of the smk_default, which is going to violate the principle
>> of least astonishment, and break the Smack policy as well.
>>
>> Let's make a minor change. Instead of using smackfsroot let's
>> use smackfstransmute and a slightly different access rule:
>>
>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>
>> Smack rule: Flintstone Rubble rwxat
>>
>> Now the only change we have to make to the Smack code is
>> that we don't want to create any files unless either the
>> process is labeled Rubble or the rule allowing the creation
>> has the "t" for transmute access. That should ensure that
>> everything is labeled Rubble. If it isn't, someone has mucked
>> with the metadata in a detectable way.
> All right, that kind of makes sense, but I'm still missing some pieces.
> Questions follow.
>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 32f598db0b0d..4597420ab933 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>  	__sb_start_write(sb, SB_FREEZE_FS, true);
>>>  }
>>>  
>>> +static inline bool sb_in_userns(struct super_block *sb)
>>> +{
>>> +	return sb->s_user_ns != &init_user_ns;
>>> +}
>>>  
>>>  extern bool inode_owner_or_capable(const struct inode *inode);
>>>  
>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>> index a143328f75eb..591fd19294e7 100644
>>> --- a/security/smack/smack_lsm.c
>>> +++ b/security/smack/smack_lsm.c
>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>  	char *buffer;
>>>  	struct smack_known *skp = NULL;
>>>  
>>> +	/* Should never fetch xattrs from untrusted mounts */
>>> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
>>> +		return ERR_PTR(-EPERM);
>>> +
>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>
>>>  	if (ip->i_op->getxattr == NULL)
>>>  		return ERR_PTR(-EOPNOTSUPP);
>>>  
>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>  		 */
>>>  		if (specified)
>>>  			return -EPERM;
>>> +
>>>  		/*
>>> -		 * Unprivileged mounts get root and default from the caller.
>>> +		 * User namespace mounts get root and default from the backing
>>> +		 * store, if there is one. Other unprivileged mounts get them
>>> +		 * from the caller.
>>>  		 */
>>> -		skp = smk_of_current();
>>> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>  		sp->smk_root = skp;
>>>  		sp->smk_default = skp;
>> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;
> I assume that you meant skp and not sp here.

Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
in the smk_flags field of the root inode. That's easy:

			transmute = 1;

and the code after "Initialize the root inode" will take care of it.


>>>  	}
>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>   */
>>>  static int smack_inode_alloc_security(struct inode *inode)
>>>  {
>>> -	struct smack_known *skp = smk_of_current();
>>> +	struct smack_known *skp;
>>> +
>>> +	if (sb_in_userns(inode->i_sb))
>>> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>> +	else
>>> +		skp = smk_of_current();
>> This should be left alone.
>> smack_inode_init_security is where you could disallow access that doesn't
>> legitimately result in a Rubble label on the file. It's something like
>>
>> 	... after the call may = smk_access_entry(...)
>> 	if (sb_in_userns(inode->i_sb))
>> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>> 			return -EACCES; 
> I'm not getting how this covers all cases.
>
> So we've set the transmute flag on the root inode. Files and directories
> created in the root directory get the same label, and directories also
> get the transmute attribute. That's all fine.
>
> What about an existing directory in the filesystem that already has a
> Slate label? I'm not getting what happens with this directory, or for
> new files created in this directory, which also relates to my other
> questions below.
>
> Also an aside - smk_access_entry looks weird. may is initialized to
> -ENOENT, and then rule_list is searched for a rule which matches the
> object and subject labels. Presumably it's possible that no rule could
> be found, otherwise the prior initialization of may is pointless. If
> this happens the following code treats it as though it always contains
> access flags even though it might contain -ENOENT. Nothing bad actually
> happens with a two's compliement representation of -ENOENT since it will
> just set a bit that's already set, but it still seems like it should
> have a may > 0 condition, for clarity if for no other reason.

My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.

If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.

static int smack_inode_permission(struct inode *inode, int mask)
{
	struct smk_audit_info ad;
	int no_block = mask & MAY_NOT_BLOCK;
	int rc;

	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
	/*
	 * No permission to check. Existence test. Yup, it's there.
	 */
	if (mask == 0)
		return 0;

+	if (sb_in_userns(inode->i_sb)) &&
+	    smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+		return -EACCES;
+
	/* May be droppable after audit */
	if (no_block)
		return -ECHILD;
	smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
	smk_ad_setfield_u_fs_inode(&ad, inode);
	rc = smk_curacc(smk_of_inode(inode), mask, &ad);
	rc = smk_bu_inode(inode, mask, rc);
	return rc;
}


>
>>>  	inode->i_security = new_inode_smack(skp);
>>>  	if (inode->i_security == NULL)
>>> @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>>>  			break;
>>>  		}
>>>  		/*
>>> +		 * Don't use labels from xattrs for unprivileged mounts.
>>> +		 */
>>> +		if (sb_in_userns(inode->i_sb))
>>> +			break;
>>> +		/*
>> Again, use the label. Just check to make sure it's what you expect.
> What happens if it's not what I expect? smack_d_instantiate cannot fail
> ... so just use the default label? In that case why bother reading it at
> all? Or would we actually want to change the on-disk label if it didn't
> match?
>
>>>  		 * No xattr support means, alas, no SMACK label.
>>>  		 * Use the aforeapplied default.
>>>  		 * It would be curious if the label of the task
>> Also untested.
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 16:52                         ` Austin S Hemmelgarn
@ 2015-07-22 17:41                           ` J. Bruce Fields
  2015-07-23  1:51                             ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: J. Bruce Fields @ 2015-07-22 17:41 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Dave Chinner, Eric W. Biederman, Casey Schaufler,
	Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> On 2015-07-22 10:09, J. Bruce Fields wrote:
> >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> >>>So, for example, a screwed up on-disk directory structure shouldn't
> >>>result in creating a cycle in the dcache and then deadlocking.
> >>
> >>Therein lies the problem: how do you detect such structural defects
> >>without doing a full structure validation?
> >
> >You can prevent cycles in a graph if you can prevent adding an edge
> >which would be part of a cycle.
> >
> Except if the user can write to the filesystem's backing storage (be
> it a device or a file), and has sufficient knowledge of the on-disk
> structures, they can create all the cycles they want in the
> metadata. So unless the kernel builds the graph internally by
> parsing the metadata _and_ has some way to detect that the on-disk
> metadata has hit a cycle (which may not just involve 2 items),

Understood.  Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.

> then
> you still have the potential for a DoS attack.

> Trust me, I've done this before (quite a while back when I was just
> starting out with programming on Linux) with hard-link cycles in an
> ext4 filesystem in a virtual machine just to see what would happen
> (IIRC, something deadlocked, I can't remember though if it was fsck
> or trying to access the file once the FS was mounted) (and in fact,
> I think I may try this again just to see if anything has changed).

I've also seen bugs caused by loops in corrupted ext4 filesystems.  As
far as I know, they're fixed as of 95ad5c291313b.

(I mentioned the example of dcache loops because it's something I
happened to run across before.  I'm sure there are any number of cases
where we need similar checking to keep internal data structures
consistent in the face of unexpected filesystem content.)

--b.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22 14:09                       ` J. Bruce Fields
@ 2015-07-22 16:52                         ` Austin S Hemmelgarn
  2015-07-22 17:41                           ` J. Bruce Fields
  0 siblings, 1 reply; 69+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-22 16:52 UTC (permalink / raw)
  To: J. Bruce Fields, Dave Chinner
  Cc: Eric W. Biederman, Casey Schaufler, Andy Lutomirski,
	Seth Forshee, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1497 bytes --]

On 2015-07-22 10:09, J. Bruce Fields wrote:
> On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
>> On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
>>> On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
>>> So, for example, a screwed up on-disk directory structure shouldn't
>>> result in creating a cycle in the dcache and then deadlocking.
>>
>> Therein lies the problem: how do you detect such structural defects
>> without doing a full structure validation?
>
> You can prevent cycles in a graph if you can prevent adding an edge
> which would be part of a cycle.
>
Except if the user can write to the filesystem's backing storage (be it 
a device or a file), and has sufficient knowledge of the on-disk 
structures, they can create all the cycles they want in the metadata. 
So unless the kernel builds the graph internally by parsing the metadata 
_and_ has some way to detect that the on-disk metadata has hit a cycle 
(which may not just involve 2 items), then you still have the potential 
for a DoS attack.

Trust me, I've done this before (quite a while back when I was just 
starting out with programming on Linux) with hard-link cycles in an ext4 
filesystem in a virtual machine just to see what would happen (IIRC, 
something deadlocked, I can't remember though if it was fsck or trying 
to access the file once the FS was mounted) (and in fact, I think I may 
try this again just to see if anything has changed).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22  1:52                       ` Casey Schaufler
@ 2015-07-22 15:56                         ` Seth Forshee
  2015-07-22 18:10                           ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-22 15:56 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> > On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>> I really don't see the benefit of making up extra rules that apply to
> >>>> users outside a userns who try to access specifically a filesystem
> >>>> with backing store.  They wouldn't make sense for filesystems without
> >>>> backing store.
> >>> Sure it would. For Smack, it would be the label a file would be
> >>> created with, which would be the label of the process creating
> >>> the memory based filesystem. For SELinux the rules are more a
> >>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>> come up with how to determine it.
> >>>
> >>> The point, looping all the way back to the beginning, where we
> >>> were talking about just ignoring the labels on the filesystem,
> >>> is that if you use the same Smack label on the files in the
> >>> filesystem as the backing store file has, we'll all be happy.
> >>> If that label isn't something user can write to, he won't be
> >>> able to write to the mounted objects, either. If there is no
> >>> backing store then use the label of the process creating the
> >>> filesystem, which will be the user, which will mean everything
> >>> will work hunky dory.
> >>>
> >>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>> the label from the backing store or the creating process is
> >>> simple enough.
> >>>
> > So something like the diff below (untested)?
> 
> I think that this is close, and quite good for someone
> who isn't very familiar with Smack. It's definitely headed
> in the right direction.
> 
> > All I'm really doing is setting smk_default as you describe above and
> > then using it instead of smk_of_current() in
> > smack_inode_alloc_security() and instead of the label from the disk in
> > smack_d_instantiate().
> 
> Let's say your backing store is a file labeled Rubble.
> 
> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
> 
> It is completely reasonable for a process labeled Flintstone to
> have rwxa access to a file labeled Rubble.
> 
> Smack rule: Flintstone Rubble rwxa
> 
> In the case of writing to an existing Rubble file, what you
> have looks fine. What's not so great is that if the Flintstone
> process creates a file, it should be labeled Flintstone. Your
> use of the smk_default, which is going to violate the principle
> of least astonishment, and break the Smack policy as well.
> 
> Let's make a minor change. Instead of using smackfsroot let's
> use smackfstransmute and a slightly different access rule:
> 
> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
> 
> Smack rule: Flintstone Rubble rwxat
> 
> Now the only change we have to make to the Smack code is
> that we don't want to create any files unless either the
> process is labeled Rubble or the rule allowing the creation
> has the "t" for transmute access. That should ensure that
> everything is labeled Rubble. If it isn't, someone has mucked
> with the metadata in a detectable way.

All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.

> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 32f598db0b0d..4597420ab933 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> >  	__sb_start_write(sb, SB_FREEZE_FS, true);
> >  }
> >  
> > +static inline bool sb_in_userns(struct super_block *sb)
> > +{
> > +	return sb->s_user_ns != &init_user_ns;
> > +}
> >  
> >  extern bool inode_owner_or_capable(const struct inode *inode);
> >  
> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> > index a143328f75eb..591fd19294e7 100644
> > --- a/security/smack/smack_lsm.c
> > +++ b/security/smack/smack_lsm.c
> > @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> >  	char *buffer;
> >  	struct smack_known *skp = NULL;
> >  
> > +	/* Should never fetch xattrs from untrusted mounts */
> > +	if (WARN_ON(sb_in_userns(ip->i_sb)))
> > +		return ERR_PTR(-EPERM);
> > +
> 
> Go ahead and fetch it, we'll check to make sure it's viable later.
> 
> >  	if (ip->i_op->getxattr == NULL)
> >  		return ERR_PTR(-EOPNOTSUPP);
> >  
> > @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >  		 */
> >  		if (specified)
> >  			return -EPERM;
> > +
> >  		/*
> > -		 * Unprivileged mounts get root and default from the caller.
> > +		 * User namespace mounts get root and default from the backing
> > +		 * store, if there is one. Other unprivileged mounts get them
> > +		 * from the caller.
> >  		 */
> > -		skp = smk_of_current();
> > +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
> > +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> >  		sp->smk_root = skp;
> >  		sp->smk_default = skp;
> 
> 			sp->smk_flags |= SMK_INODE_TRANSMUTE;

I assume that you meant skp and not sp here.

> 
> >  	}
> > @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> >   */
> >  static int smack_inode_alloc_security(struct inode *inode)
> >  {
> > -	struct smack_known *skp = smk_of_current();
> > +	struct smack_known *skp;
> > +
> > +	if (sb_in_userns(inode->i_sb))
> > +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> > +	else
> > +		skp = smk_of_current();
> 
> This should be left alone.
> smack_inode_init_security is where you could disallow access that doesn't
> legitimately result in a Rubble label on the file. It's something like
> 
> 	... after the call may = smk_access_entry(...)
> 	if (sb_in_userns(inode->i_sb))
> 		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> 			return -EACCES; 

I'm not getting how this covers all cases.

So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.

What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.

Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.

> 
> >  	inode->i_security = new_inode_smack(skp);
> >  	if (inode->i_security == NULL)
> > @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> >  			break;
> >  		}
> >  		/*
> > +		 * Don't use labels from xattrs for unprivileged mounts.
> > +		 */
> > +		if (sb_in_userns(inode->i_sb))
> > +			break;
> > +		/*
> 
> Again, use the label. Just check to make sure it's what you expect.

What happens if it's not what I expect? smack_d_instantiate cannot fail
... so just use the default label? In that case why bother reading it at
all? Or would we actually want to change the on-disk label if it didn't
match?

> 
> >  		 * No xattr support means, alas, no SMACK label.
> >  		 * Use the aforeapplied default.
> >  		 * It would be curious if the label of the task
> 
> Also untested.
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-22  7:56                     ` Dave Chinner
@ 2015-07-22 14:09                       ` J. Bruce Fields
  2015-07-22 16:52                         ` Austin S Hemmelgarn
  0 siblings, 1 reply; 69+ messages in thread
From: J. Bruce Fields @ 2015-07-22 14:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric W. Biederman, Casey Schaufler, Andy Lutomirski,
	Seth Forshee, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > > > Dave Chinner <david@fromorbit.com> writes:
> > > > > The key difference is that desktops only do this when you physically
> > > > > plug in a device. With unprivileged mounts, a hostile attacker
> > > > > doesn't need physical access to the machine to exploit lurking
> > > > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > > > they can keep mounting corrupted images until they find something
> > > > > that works.
> > > > 
> > > > Yep.  That magnifies the problem quite a bit.
> > > > 
> > > > > User namespaces are supposed to provide trust separation.  The
> > > > > kernel filesystems simply aren't hardened against unprivileged
> > > > > attacks from below - there is a trust relationship between root and
> > > > > the filesystem in that they are the only things that can write to
> > > > > the disk. Mounts from within a userns destroys this relationship as
> > > > > the userns root, by definition, is not a trusted actor.
> > > > 
> > > > I talked to Ted Tso a while back and ext4 is at least in principle
> > > > already hardened against that kind of attack.  I am not certain I
> > > > believe it, but if it is true I think it is fantastic.
> > > 
> > > No, it's not. No filesystem is, because to harden against such
> > > attacks requires complete verification of all metadata when it is
> > > read from disk, before it is used, or some method or ensuring the
> > > block was not tampered with. CRCs are not sufficient, because they
> > > can be tampered with, too.
> > > 
> > > The only way a filesystem would be able to trust what it reads from
> > > disk has not been tampered with in a system with untrusted mounts is
> > > if it has some kind of cryptographically secure signature in the
> > > metadata and the attacker is unable to access the key for that
> > > signature.
> > 
> > Preventing tampering is a little different from protecting the kernel
> > from attack, isn't it?  I thought the latter was what people were asking
> > about.
> 
> People might be asking for the latter, but the only attack vector
> that can be made against filesystems from below is via tampering
> with the on-disk structure.
> 
> An untrusted user in an untrusted container can construct arbitrary
> untrusted filesystem structures and get them parsed by a context
> running as $DIETY that assumes the structure is from a trusted
> source.  What can possibly go wrong?
> 
> IOWs, To protect the kernel against attack from untrusted filesystem
> images, we either have to be able to guarantee the image can not be
> modified by untrusted parties (i.e.  needs to be created with
> signed tools, contain only signed filesystem metadata and
> signed/encrypted data),

I don't think that works--who exactly would be the "trusted party"?  It
can't be this kernel or this hardware--users expect to be able to mount
filesystems created by older kernels, on other machines, running other
distributions (even other operating systems).  It can't be the
user--then any user could compromise the kernel by signing a bad
filesystem.

Authenticating the creator of the filesystem might be useful for other
reasons, but it sounds to me like at best only very weak protection
against corrupted filesystems.

As a similar example, browser makers are stuck both implementing SSL and
hardening their code against malicious content.  Those address separate
problems.

> or we have to sandbox the filesystem parsing
> code completely (i.e. fuse).
> 
> > So, for example, a screwed up on-disk directory structure shouldn't
> > result in creating a cycle in the dcache and then deadlocking.
> 
> Therein lies the problem: how do you detect such structural defects
> without doing a full structure validation?

You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.

For the dcache, it's d_splice_alias that does that (using d_ancestor).

(And I believe the main motivation for that was NFS, where you don't
need a filesystem cycle, just a server-side race that can briefly make
it look like there's one--an example of the changing filesystem problem
that you point out below.)

> e.g. cyclic links may
> only manifest when completely unrelated pieces of metadata are linked
> together in a specific way.
>
> Further, the problem is not restricted to validation at mount time -
> if the user can write to the filesystem image file, then they can
> modify it after it has been mounted, too. That means the attacker
> may be someone who has broken into a container, not necessarily the
> user you trusted with unprivileged mounts. That means every cold
> metadata read needs to be treated with suspicion, not just at mount
> time.

Yes.  Agreed that this is difficult.  (I can't actually give an example
of an existing problem of this sort, but I'd be surprised if they don't
exist.)

--b.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-21 17:37                   ` J. Bruce Fields
@ 2015-07-22  7:56                     ` Dave Chinner
  2015-07-22 14:09                       ` J. Bruce Fields
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-07-22  7:56 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Eric W. Biederman, Casey Schaufler, Andy Lutomirski,
	Seth Forshee, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > > Dave Chinner <david@fromorbit.com> writes:
> > > > The key difference is that desktops only do this when you physically
> > > > plug in a device. With unprivileged mounts, a hostile attacker
> > > > doesn't need physical access to the machine to exploit lurking
> > > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > > they can keep mounting corrupted images until they find something
> > > > that works.
> > > 
> > > Yep.  That magnifies the problem quite a bit.
> > > 
> > > > User namespaces are supposed to provide trust separation.  The
> > > > kernel filesystems simply aren't hardened against unprivileged
> > > > attacks from below - there is a trust relationship between root and
> > > > the filesystem in that they are the only things that can write to
> > > > the disk. Mounts from within a userns destroys this relationship as
> > > > the userns root, by definition, is not a trusted actor.
> > > 
> > > I talked to Ted Tso a while back and ext4 is at least in principle
> > > already hardened against that kind of attack.  I am not certain I
> > > believe it, but if it is true I think it is fantastic.
> > 
> > No, it's not. No filesystem is, because to harden against such
> > attacks requires complete verification of all metadata when it is
> > read from disk, before it is used, or some method or ensuring the
> > block was not tampered with. CRCs are not sufficient, because they
> > can be tampered with, too.
> > 
> > The only way a filesystem would be able to trust what it reads from
> > disk has not been tampered with in a system with untrusted mounts is
> > if it has some kind of cryptographically secure signature in the
> > metadata and the attacker is unable to access the key for that
> > signature.
> 
> Preventing tampering is a little different from protecting the kernel
> from attack, isn't it?  I thought the latter was what people were asking
> about.

People might be asking for the latter, but the only attack vector
that can be made against filesystems from below is via tampering
with the on-disk structure.

An untrusted user in an untrusted container can construct arbitrary
untrusted filesystem structures and get them parsed by a context
running as $DIETY that assumes the structure is from a trusted
source.  What can possibly go wrong?

IOWs, To protect the kernel against attack from untrusted filesystem
images, we either have to be able to guarantee the image can not be
modified by untrusted parties (i.e.  needs to be created with
signed tools, contain only signed filesystem metadata and
signed/encrypted data), or we have to sandbox the filesystem parsing
code completely (i.e. fuse).

> So, for example, a screwed up on-disk directory structure shouldn't
> result in creating a cycle in the dcache and then deadlocking.

Therein lies the problem: how do you detect such structural defects
without doing a full structure validation? e.g. cyclic links may
only manifest when completely unrelated pieces of metadata are linked
together in a specific way.

Further, the problem is not restricted to validation at mount time -
if the user can write to the filesystem image file, then they can
modify it after it has been mounted, too. That means the attacker
may be someone who has broken into a container, not necessarily the
user you trusted with unprivileged mounts. That means every cold
metadata read needs to be treated with suspicion, not just at mount
time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-21 20:35                     ` Seth Forshee
@ 2015-07-22  1:52                       ` Casey Schaufler
  2015-07-22 15:56                         ` Seth Forshee
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-22  1:52 UTC (permalink / raw)
  To: Seth Forshee, Andy Lutomirski
  Cc: Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel, Casey Schaufler

On 7/21/2015 1:35 PM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>> I really don't see the benefit of making up extra rules that apply to
>>>> users outside a userns who try to access specifically a filesystem
>>>> with backing store.  They wouldn't make sense for filesystems without
>>>> backing store.
>>> Sure it would. For Smack, it would be the label a file would be
>>> created with, which would be the label of the process creating
>>> the memory based filesystem. For SELinux the rules are more a
>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>> come up with how to determine it.
>>>
>>> The point, looping all the way back to the beginning, where we
>>> were talking about just ignoring the labels on the filesystem,
>>> is that if you use the same Smack label on the files in the
>>> filesystem as the backing store file has, we'll all be happy.
>>> If that label isn't something user can write to, he won't be
>>> able to write to the mounted objects, either. If there is no
>>> backing store then use the label of the process creating the
>>> filesystem, which will be the user, which will mean everything
>>> will work hunky dory.
>>>
>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>> the label from the backing store or the creating process is
>>> simple enough.
>>>
> So something like the diff below (untested)?

I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.

> All I'm really doing is setting smk_default as you describe above and
> then using it instead of smk_of_current() in
> smack_inode_alloc_security() and instead of the label from the disk in
> smack_d_instantiate().

Let's say your backing store is a file labeled Rubble.

mount -o smackfsroot=Rubble,smackfsdef=Rubble ...

It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.

Smack rule: Flintstone Rubble rwxa

In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.

Let's make a minor change. Instead of using smackfsroot let's
use smackfstransmute and a slightly different access rule:

mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...

Smack rule: Flintstone Rubble rwxat

Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.
 

> Since a user currently needs CAP_MAC_ADMIN in
> init_user_ns to store security labels it looks like this should be
> sufficient. I'm not even sure that the inode_alloc_security hook changes
> are needed.
>
> We could allow privileged users in s_user_ns to write security labels to
> disk since they already control the backing store, as long as Smack
> didn't subsequently import them. I didn't do that here.
>
>> So what if Smack used the label of the user creating the filesystem
>> even for filesystems with backing store?  IMO this ought to be doable
>> with the LSM hooks -- it certainly seems reasonable for the LSM to be
>> aware of who created a filesystem.  In fact, I'd argue that if Smack
>> can't do this with the proposed LSM hooks, then the hooks are
>> insufficient.
> It would be very simple to use the label of the task instead.
>
> Seth
>
> ---
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 32f598db0b0d..4597420ab933 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>  	__sb_start_write(sb, SB_FREEZE_FS, true);
>  }
>  
> +static inline bool sb_in_userns(struct super_block *sb)
> +{
> +	return sb->s_user_ns != &init_user_ns;
> +}
>  
>  extern bool inode_owner_or_capable(const struct inode *inode);
>  
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..591fd19294e7 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>  	char *buffer;
>  	struct smack_known *skp = NULL;
>  
> +	/* Should never fetch xattrs from untrusted mounts */
> +	if (WARN_ON(sb_in_userns(ip->i_sb)))
> +		return ERR_PTR(-EPERM);
> +

Go ahead and fetch it, we'll check to make sure it's viable later.

>  	if (ip->i_op->getxattr == NULL)
>  		return ERR_PTR(-EOPNOTSUPP);
>  
> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>  		 */
>  		if (specified)
>  			return -EPERM;
> +
>  		/*
> -		 * Unprivileged mounts get root and default from the caller.
> +		 * User namespace mounts get root and default from the backing
> +		 * store, if there is one. Other unprivileged mounts get them
> +		 * from the caller.
>  		 */
> -		skp = smk_of_current();
> +		skp = (sb_in_userns(sb) && sb->s_bdev) ?
> +			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>  		sp->smk_root = skp;
>  		sp->smk_default = skp;

			sp->smk_flags |= SMK_INODE_TRANSMUTE;

>  	}
> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>   */
>  static int smack_inode_alloc_security(struct inode *inode)
>  {
> -	struct smack_known *skp = smk_of_current();
> +	struct smack_known *skp;
> +
> +	if (sb_in_userns(inode->i_sb))
> +		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> +	else
> +		skp = smk_of_current();

This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like

	... after the call may = smk_access_entry(...)
	if (sb_in_userns(inode->i_sb))
		if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
			return -EACCES; 

>  	inode->i_security = new_inode_smack(skp);
>  	if (inode->i_security == NULL)
> @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>  			break;
>  		}
>  		/*
> +		 * Don't use labels from xattrs for unprivileged mounts.
> +		 */
> +		if (sb_in_userns(inode->i_sb))
> +			break;
> +		/*

Again, use the label. Just check to make sure it's what you expect.

>  		 * No xattr support means, alas, no SMACK label.
>  		 * Use the aforeapplied default.
>  		 * It would be curious if the label of the task

Also untested.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:59                   ` Andy Lutomirski
  2015-07-17 14:28                     ` Serge E. Hallyn
@ 2015-07-21 20:35                     ` Seth Forshee
  2015-07-22  1:52                       ` Casey Schaufler
  1 sibling, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-21 20:35 UTC (permalink / raw)
  To: Casey Schaufler, Andy Lutomirski
  Cc: Eric W. Biederman, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> > On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >> I really don't see the benefit of making up extra rules that apply to
> >> users outside a userns who try to access specifically a filesystem
> >> with backing store.  They wouldn't make sense for filesystems without
> >> backing store.
> >
> > Sure it would. For Smack, it would be the label a file would be
> > created with, which would be the label of the process creating
> > the memory based filesystem. For SELinux the rules are more a
> > touch more sophisticated, but I'm sure that Paul or Stephen could
> > come up with how to determine it.
> >
> > The point, looping all the way back to the beginning, where we
> > were talking about just ignoring the labels on the filesystem,
> > is that if you use the same Smack label on the files in the
> > filesystem as the backing store file has, we'll all be happy.
> > If that label isn't something user can write to, he won't be
> > able to write to the mounted objects, either. If there is no
> > backing store then use the label of the process creating the
> > filesystem, which will be the user, which will mean everything
> > will work hunky dory.
> >
> > Yes, there's work involved, but I doubt there's a lot. Getting
> > the label from the backing store or the creating process is
> > simple enough.
> >

So something like the diff below (untested)?

All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate(). Since a user currently needs CAP_MAC_ADMIN in
init_user_ns to store security labels it looks like this should be
sufficient. I'm not even sure that the inode_alloc_security hook changes
are needed.

We could allow privileged users in s_user_ns to write security labels to
disk since they already control the backing store, as long as Smack
didn't subsequently import them. I didn't do that here.

> So what if Smack used the label of the user creating the filesystem
> even for filesystems with backing store?  IMO this ought to be doable
> with the LSM hooks -- it certainly seems reasonable for the LSM to be
> aware of who created a filesystem.  In fact, I'd argue that if Smack
> can't do this with the proposed LSM hooks, then the hooks are
> insufficient.

It would be very simple to use the label of the task instead.

Seth

---

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
 	__sb_start_write(sb, SB_FREEZE_FS, true);
 }
 
+static inline bool sb_in_userns(struct super_block *sb)
+{
+	return sb->s_user_ns != &init_user_ns;
+}
 
 extern bool inode_owner_or_capable(const struct inode *inode);
 
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
 	char *buffer;
 	struct smack_known *skp = NULL;
 
+	/* Should never fetch xattrs from untrusted mounts */
+	if (WARN_ON(sb_in_userns(ip->i_sb)))
+		return ERR_PTR(-EPERM);
+
 	if (ip->i_op->getxattr == NULL)
 		return ERR_PTR(-EOPNOTSUPP);
 
@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
 		 */
 		if (specified)
 			return -EPERM;
+
 		/*
-		 * Unprivileged mounts get root and default from the caller.
+		 * User namespace mounts get root and default from the backing
+		 * store, if there is one. Other unprivileged mounts get them
+		 * from the caller.
 		 */
-		skp = smk_of_current();
+		skp = (sb_in_userns(sb) && sb->s_bdev) ?
+			smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
 		sp->smk_root = skp;
 		sp->smk_default = skp;
 	}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
  */
 static int smack_inode_alloc_security(struct inode *inode)
 {
-	struct smack_known *skp = smk_of_current();
+	struct smack_known *skp;
+
+	if (sb_in_userns(inode->i_sb))
+		skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+	else
+		skp = smk_of_current();
 
 	inode->i_security = new_inode_smack(skp);
 	if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
 			break;
 		}
 		/*
+		 * Don't use labels from xattrs for unprivileged mounts.
+		 */
+		if (sb_in_userns(inode->i_sb))
+			break;
+		/*
 		 * No xattr support means, alas, no SMACK label.
 		 * Use the aforeapplied default.
 		 * It would be curious if the label of the task

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  2:47                 ` Dave Chinner
@ 2015-07-21 17:37                   ` J. Bruce Fields
  2015-07-22  7:56                     ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: J. Bruce Fields @ 2015-07-21 17:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric W. Biederman, Casey Schaufler, Andy Lutomirski,
	Seth Forshee, Alexander Viro, Linux FS Devel, LSM List,
	SELinux-NSA, Serge Hallyn, linux-kernel

On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > Dave Chinner <david@fromorbit.com> writes:
> > 
> > > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> > >> Casey Schaufler <casey@schaufler-ca.com> writes:
> > >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> > >> >> If I mount an unprivileged filesystem, then either the contents were
> > >> >> put there *by me*, in which case letting me access them are fine, or
> > >> >> (with Seth's patches and then some) I control the backing store, in
> > >> >> which case I can do whatever I want regardless of what LSM thinks.
> > >> >>
> > >> >> So I don't see the problem.  Why would Smack or any other LSM care at
> > >> >> all, unless it wants to prevent me from mounting the fs in the first
> > >> >> place?
> > >> >
> > >> > First off, I don't cotton to the notion that you should be able
> > >> > to mount filesystems without privilege. But it seems I'm being
> > >> > outvoted on that. I suspect that there are cases where it might
> > >> > be safe, but I can't think of one off the top of my head.
> > >> 
> > >> There are two fundamental issues mounting filesystems without privielge,
> > >> by which I actually mean mounting filesystems as the root user in a user
> > >> namespace.
> > >> 
> > >> - Are the semantics safe.
> > >> - Is the extra attack surface a problem.
> > >
> > > I think the attack surface this exposes is the biggest problem
> > > facing this proposal.
> > 
> > I completely agree.
> > 
> > >> Figuring out how to make semantics safe is what we are talking about.
> > >> 
> > >> Once we sort out the semantics we can look at the handful of filesystems
> > >> like fuse where the extra attack surface is not a concern.
> > >> 
> > >> With that said desktop environments have for a long time been
> > >> automatically mounting whichever filesystem you place in your computer,
> > >> so in practice what this is really about is trying to align the kernel
> > >> with how people use filesystems.
> > >
> > > The key difference is that desktops only do this when you physically
> > > plug in a device. With unprivileged mounts, a hostile attacker
> > > doesn't need physical access to the machine to exploit lurking
> > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > they can keep mounting corrupted images until they find something
> > > that works.
> > 
> > Yep.  That magnifies the problem quite a bit.
> > 
> > > User namespaces are supposed to provide trust separation.  The
> > > kernel filesystems simply aren't hardened against unprivileged
> > > attacks from below - there is a trust relationship between root and
> > > the filesystem in that they are the only things that can write to
> > > the disk. Mounts from within a userns destroys this relationship as
> > > the userns root, by definition, is not a trusted actor.
> > 
> > I talked to Ted Tso a while back and ext4 is at least in principle
> > already hardened against that kind of attack.  I am not certain I
> > believe it, but if it is true I think it is fantastic.
> 
> No, it's not. No filesystem is, because to harden against such
> attacks requires complete verification of all metadata when it is
> read from disk, before it is used, or some method or ensuring the
> block was not tampered with. CRCs are not sufficient, because they
> can be tampered with, too.
> 
> The only way a filesystem would be able to trust what it reads from
> disk has not been tampered with in a system with untrusted mounts is
> if it has some kind of cryptographically secure signature in the
> metadata and the attacker is unable to access the key for that
> signature.

Preventing tampering is a little different from protecting the kernel
from attack, isn't it?  I thought the latter was what people were asking
about.

So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.

--b.

> No filesystem we have has that capability and AFAIA there
> are no plans for any filesystem to implement such tamper detection.
> And no, ext4 encryption does not provide this because it only stores
> the values and data in encrypted format and does not protect
> metadata from tampering when it is not mounted.
> 
> If we don't have crypto signatures in metadata, then XFS is probably
> the most robust against tampering as it does a lot more checking of
> the on-disk metadata before it is used than any other filesystem
> (i.e. see the verifier infrastructure that does corruption checks
> after read (in io completion) and before write (in io submission)
> to catch bad metadata before it is used by the kernel, or before it
> is written to disk by the kernel.
> 
> However, these checks are far from comprehensive. we can only check
> internal consistency of the metadata objects in the block, and even
> then we really only can check for values within range rather than
> absolute correctness. e.g. we can check a dirent has a valid name,
> length, ftype and inode number, but we can't validate that the inode
> is actually allocated or not because that requires a lookup in the
> allocated inode btree. We *trust* that inode number to be
> allocated and valid because it is in metadata the filesystem wrote.
> 
> For inode numbers that come from untrusted sources (NFS,
> open-by-handle, etc) we have a flag that does inode number
> validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
> metadata (i.e. the allocated inode btrees), but that is expensive
> and so not done on inodes that we pull directly from metadata that
> has come from disk. Indeed, we still trust on-disk metadata to be
> correct to validate that other metadata canbe trusted, so if one
> structure can be tampered with, so can others.
> 
> IOWs, if we cannot trust one part of the filesystem metadata to be
> correct, then we cannot trust that filesystem *at all*, *for
> anything*. And even running fsck doesn't restore trust - all it does
> is tell us that any modification that was made is not a detectable
> inconsistency that needs fixing.
> 
> > At this point any setting of the FS_USER_MOUNT flag I figure needs to go
> > through the filesystem maintainers tree and they need to be aware of and
> > agree to deal with the attack from below issue.
> > 
> > The one filesystem I truly expect we can make work is fuse.  fuse has
> > been designed to deal with some variation of the attack from below issue
> > since day one.  We looked at what the patches to fuse would look like
> > with the current state of the vfs and it was not pretty.
> > 
> > We very much need to sort through as much as possible at the vfs layer,
> > and in generic code.  Allow everyone to see what is going on and how
> > it works before preceeding forward with enabling any filesystems.
> 
> The VFS protects us from attacks from above the filesystem, not
> below. The VFS plays no part in validating the on-disk structure of
> a filesystem which is what attacks from below will be attempting to
> exploit.
> 
> > I truly hope we can find a small set of block device filesystems that we
> > can harden from attack below.   That would allow linux to have serious
> > defenses against evil usb stick attacks.  I think that is going to take
> > a lot of careful coding, testing and validation and advancing the state
> > of the art to get there.
> 
> Somehow, I just can't see that happening.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  4:47           ` Eric W. Biederman
  2015-07-17  0:09             ` Dave Chinner
@ 2015-07-20 17:54             ` Colin Walters
  1 sibling, 0 replies; 69+ messages in thread
From: Colin Walters @ 2015-07-20 17:54 UTC (permalink / raw)
  To: Eric W. Biederman, Casey Schaufler
  Cc: Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote:

> With that said desktop environments have for a long time been
> automatically mounting whichever filesystem you place in your computer,
> so in practice what this is really about is trying to align the kernel
> with how people use filesystems.

There is a large attack surface difference between mounting a device
that someone physically plugged into the computer (and note typically
it's required that the active console be unlocked as well[1]) versus
allowing any "unprivileged" process at any time to do it.

Many server setups use "unprivileged" uids that otherwise wouldn't
be able to exploit bugs in filesystem code.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=653520
"AutomountManager also keeps track of the current session availability
(using the ConsoleKit and gnome-screensaver DBus interfaces) and
inhibits mounting if the current session is locked, or another session
is in use instead."


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:42               ` Eric W. Biederman
  2015-07-17  2:47                 ` Dave Chinner
@ 2015-07-18  0:07                 ` Serge E. Hallyn
  1 sibling, 0 replies; 69+ messages in thread
From: Serge E. Hallyn @ 2015-07-18  0:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Chinner, Casey Schaufler, Andy Lutomirski, Seth Forshee,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <casey@schaufler-ca.com> writes:
> >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> >> If I mount an unprivileged filesystem, then either the contents were
> >> >> put there *by me*, in which case letting me access them are fine, or
> >> >> (with Seth's patches and then some) I control the backing store, in
> >> >> which case I can do whatever I want regardless of what LSM thinks.
> >> >>
> >> >> So I don't see the problem.  Why would Smack or any other LSM care at
> >> >> all, unless it wants to prevent me from mounting the fs in the first
> >> >> place?
> >> >
> >> > First off, I don't cotton to the notion that you should be able
> >> > to mount filesystems without privilege. But it seems I'm being
> >> > outvoted on that. I suspect that there are cases where it might
> >> > be safe, but I can't think of one off the top of my head.
> >> 
> >> There are two fundamental issues mounting filesystems without privielge,
> >> by which I actually mean mounting filesystems as the root user in a user
> >> namespace.
> >> 
> >> - Are the semantics safe.
> >> - Is the extra attack surface a problem.
> >
> > I think the attack surface this exposes is the biggest problem
> > facing this proposal.
> 
> I completely agree.
> 
> >> Figuring out how to make semantics safe is what we are talking about.
> >> 
> >> Once we sort out the semantics we can look at the handful of filesystems
> >> like fuse where the extra attack surface is not a concern.
> >> 
> >> With that said desktop environments have for a long time been
> >> automatically mounting whichever filesystem you place in your computer,
> >> so in practice what this is really about is trying to align the kernel
> >> with how people use filesystems.
> >
> > The key difference is that desktops only do this when you physically
> > plug in a device. With unprivileged mounts, a hostile attacker
> > doesn't need physical access to the machine to exploit lurking
> > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > they can keep mounting corrupted images until they find something
> > that works.
> 
> Yep.  That magnifies the problem quite a bit.
> 
> > User namespaces are supposed to provide trust separation.  The
> > kernel filesystems simply aren't hardened against unprivileged
> > attacks from below - there is a trust relationship between root and
> > the filesystem in that they are the only things that can write to
> > the disk. Mounts from within a userns destroys this relationship as
> > the userns root, by definition, is not a trusted actor.
> 
> I talked to Ted Tso a while back and ext4 is at least in principle
> already hardened against that kind of attack.  I am not certain I
> believe it, but if it is true I think it is fantastic.

Not sure what he said in private, but at the kernel summit last year
what he said was not that it was "hardened", but that any bugs which would
result from mounting a garbage image (i.e. an unpriv user fuzzing)
would be deemed by him a real bug.  As opposed to saying "don't do that".

To the best of my knowledge that's so far only the case with Ted/ext4,
which I assume is why Seth started with ext4.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17 13:21           ` Seth Forshee
@ 2015-07-17 17:14             ` Casey Schaufler
  0 siblings, 0 replies; 69+ messages in thread
From: Casey Schaufler @ 2015-07-17 17:14 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On 7/17/2015 6:21 AM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 02:42:22PM -0700, Casey Schaufler wrote:
>
> <snip>
>
>>> I welcome feedback about anything I've missed, but stating generally
>>> that you think I probably missed something isn't very helpful.
>> True enough. I hope I've explained myself above.
> Thanks, that definitely clarified where we were having a disconnect.
> Andy's done a fantastic job explaining how those concerns are addressed.
>
>>> The LSM issue is thornier than the rest of it though, which is why I
>>> specifically asked for review there in the cover letter. There's a lot
>>> of complexity and nuance, and I still don't have a grasp on all the
>>> subtleties. One such subtlety is the full impact of simply ignoring the
>>> security labels on disk (but I am still confused as to why this is
>>> different from filesystems which don't support xattrs at all).
>> If you can mount a filesystem such that the labels are ignored you
>> are effectively specifying that the Smack label on the files be 
>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>> Without it, it's not.
>>
>>> I was unaware of Lukasz's patches until yesterday, and I will have a
>>> look at them. But since we don't have the LSM support for user
>>> namespaces yet, I don't see the problem with doing something safe for
>>> LSMs initially and evolving the LSM integration for user ns mounts along
>>> with the rest of the user ns integration.
>> Ignoring the security attributes is not safe!
> Understood. It's surely safe for each LSM to deny such mounts until it
> has a way to handle them safely though.
>
> I'm not trying to completely punt on the issue of security modules, just
> break this down into more manageable chunks. You've given good guidance
> for Smack (thanks very much for that), so I can plan to work on that
> soon.
>
>>> Your point is taken about my less-than-expert opinion about the other
>>> security modules. We should at minimum get acks from the maintainers of
>>> those modules that unprivileged mounts will not compromise MAC.
>> I am the Smack maintainer. Unprivileged mounts as you have
>> described them compromise MAC. They compromise DAC, too.
> It looks like Andy's more or less convinced you that DAC isn't
> (additionally?) compromised. And there's a plan for MAC, that the
> security module can deny mounts from user namespaces until it has a
> solution for allowing them safely.

I wouldn't say that Andy has me convinced on DAC. I would say that
he's taken me deeper into the details of namespaces than I feel
comfortable making arguments about. I don't know that he's right,
I just don't know how to argue that he isn't. Part of what bothers
me is the dependence on namespaces. If you could come up with a
mechanism that wasn't dependent on namespaces it would be much
easier for dinosaurs like me to comprehend.

As far as declaring that MAC and namespace owned mounts are
incompatible goes, I think that I said early on that wasn't
going to fly. Too much of the Linux population (Fedora, Android,
Tizen, ...) uses MAC for the feature to be considered ready
for general consumption without it. And no, I don't believe in
partial implementations. You wouldn't get away with putting this
in if it only worked on s370 processors.

>>> For Smack specifically, I believe my only concern was the SMACK64EXEC
>>> attribute, as all the other attributes only affected subjects' access to
>>> the files. So maybe it would be possible to simply ignore this attribute
>>> in unprivileged mounts and respect the others, even lacking more
>>> complete LSM support for user namespaces.
>> SMACK64EXEC is analogous to the setuid bit, but I would rather see
>> exec() of programs with this attribute refused that for it to be
>> blindly ignored.
> That's fine, it's your call.

I said it, but on reflection the current NOSETUID behavior is
as you described it, so I wouldn't change that.

>
> Thanks,
> Seth
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17 14:28                     ` Serge E. Hallyn
@ 2015-07-17 14:56                       ` Seth Forshee
  0 siblings, 0 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-17 14:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Casey Schaufler, Eric W. Biederman,
	Alexander Viro, Linux FS Devel, LSM List, SELinux-NSA,
	Serge Hallyn, linux-kernel

On Fri, Jul 17, 2015 at 09:28:32AM -0500, Serge E. Hallyn wrote:
> > > If you're going to be at LinuxCon in Seattle we should
> > > continue this discussion over the beverage of your choice.
> > 
> > There's a small but not quite zero chance I'll be there.  I'll
> > probably be in Seoul.  It's too bad that LSS and KS are in different
> > places this year.
> 
> FWIW I'll be there and happy to discuss.

I'll also be in Seattle and happy to discuss.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:59                   ` Andy Lutomirski
@ 2015-07-17 14:28                     ` Serge E. Hallyn
  2015-07-17 14:56                       ` Seth Forshee
  2015-07-21 20:35                     ` Seth Forshee
  1 sibling, 1 reply; 69+ messages in thread
From: Serge E. Hallyn @ 2015-07-17 14:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Casey Schaufler, Seth Forshee, Eric W. Biederman, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> > On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >> I really don't see the benefit of making up extra rules that apply to
> >> users outside a userns who try to access specifically a filesystem
> >> with backing store.  They wouldn't make sense for filesystems without
> >> backing store.
> >
> > Sure it would. For Smack, it would be the label a file would be
> > created with, which would be the label of the process creating
> > the memory based filesystem. For SELinux the rules are more a
> > touch more sophisticated, but I'm sure that Paul or Stephen could
> > come up with how to determine it.
> >
> > The point, looping all the way back to the beginning, where we
> > were talking about just ignoring the labels on the filesystem,
> > is that if you use the same Smack label on the files in the
> > filesystem as the backing store file has, we'll all be happy.
> > If that label isn't something user can write to, he won't be
> > able to write to the mounted objects, either. If there is no
> > backing store then use the label of the process creating the
> > filesystem, which will be the user, which will mean everything
> > will work hunky dory.
> >
> > Yes, there's work involved, but I doubt there's a lot. Getting
> > the label from the backing store or the creating process is
> > simple enough.
> >
> 
> So what if Smack used the label of the user creating the filesystem
> even for filesystems with backing store?  IMO this ought to be doable

The more usual LSM-ish way to handle this would be to ask the LSM, at
mount time, with a new security_mount_bdev_in_userns() hook, passing
it the user's label and the backing store's label (if any), and storing
the label to be used for the files.  Even more LSM-ish (though risking
performance hit) would be to then have the LSM at each inode_init_security
decide whether to use that label or trust what's in the fs anyway (or
do something else).  That could allow the LSM to use policy to decide
that.

Because I don't know that for all LSMs it makes sense for a 'subject'
label to be assigned to an object.

> with the LSM hooks -- it certainly seems reasonable for the LSM to be
> aware of who created a filesystem.  In fact, I'd argue that if Smack
> can't do this with the proposed LSM hooks, then the hooks are
> insufficient.
> 
> Presumably Smack could also figure out what was mounted, but keep in
> mind that there are filesystems like ntfs-3g out there.  While ntfs-3g
> logically has backing store, I don't think the kernel actually knows
> about it.
> 
> >
> >>>>> If you can mount a filesystem such that the labels are ignored you
> >>>>> are effectively specifying that the Smack label on the files be
> >>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> >>>>> Without it, it's not.
> >>>> Can you explain what the threat model is here?  I don't see what it is
> >>>> that you're trying to prevent.
> >>> Um, OK.
> >>> The filesystem has files with a hundred different Smack labels on it.
> >>> I mount it as an unlabeled filesystem and everything is readable by
> >>> everyone. Bad jojo.
> >> I still don't understand.  If it's a filesystem backed by a file that
> >> Seth has RW access to, then Seth can read everything on it, full stop.
> >> The security labels in the filesystem are irrelevant.
> >
> > Well, they can't be trusted, if that's what you mean.
> > That's why I'm saying that the objects exposed by mounting
> > this backing store need to be treated with the same security
> > attributes as the backing store. Fudge it for DAC if you are
> > so inclined, but I think it's the right way to go for MAC.
> >
> >> This is like saying that, if you put restrictive labels in the
> >> filesystem that lives on /dev/sda2 and give Seth ownership of
> >> /dev/sda2, then you expect Seth to be unable to bypass the policy
> >> specifies by your labels.
> >
> > Consider the Smack label on /dev/sda2. Smack does not care
> > who owns it, just what the Smack label is. Just like on
> > ~/seth/myfs. The backing store "object" is /dev/sda2 in the
> > one case, ~/seth/myfs in the other, and something in the ether
> > for a memory based filesystem. So long as the labels of the
> > files exposed on the mount point match those of the backing
> > store "object", Smack is going to be happy. Since you're
> > running without privilege, you can't change the labels on
> > the files.
> >
> > Now Seth, being the sneaky person that he is, could change
> > the Smack labels on the files in the backing store while it's
> > offline. Since he has access to the backing store, he can't
> > give himself more access by changing the labels within the
> > filesystem. He can give himself less, but I'm OK with that.
> >
> >> Or maybe I'm misunderstanding you.
> >
> > Probably, but I'm undoubtedly doing the same.
> >
> > If you're going to be at LinuxCon in Seattle we should
> > continue this discussion over the beverage of your choice.
> 
> There's a small but not quite zero chance I'll be there.  I'll
> probably be in Seoul.  It's too bad that LSS and KS are in different
> places this year.

FWIW I'll be there and happy to discuss.

-serge

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 21:42         ` Casey Schaufler
  2015-07-16 22:27           ` Andy Lutomirski
@ 2015-07-17 13:21           ` Seth Forshee
  2015-07-17 17:14             ` Casey Schaufler
  1 sibling, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-17 13:21 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On Thu, Jul 16, 2015 at 02:42:22PM -0700, Casey Schaufler wrote:

<snip>

> > I welcome feedback about anything I've missed, but stating generally
> > that you think I probably missed something isn't very helpful.
> 
> True enough. I hope I've explained myself above.

Thanks, that definitely clarified where we were having a disconnect.
Andy's done a fantastic job explaining how those concerns are addressed.

> > The LSM issue is thornier than the rest of it though, which is why I
> > specifically asked for review there in the cover letter. There's a lot
> > of complexity and nuance, and I still don't have a grasp on all the
> > subtleties. One such subtlety is the full impact of simply ignoring the
> > security labels on disk (but I am still confused as to why this is
> > different from filesystems which don't support xattrs at all).
> 
> If you can mount a filesystem such that the labels are ignored you
> are effectively specifying that the Smack label on the files be 
> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> Without it, it's not.
> 
> > I was unaware of Lukasz's patches until yesterday, and I will have a
> > look at them. But since we don't have the LSM support for user
> > namespaces yet, I don't see the problem with doing something safe for
> > LSMs initially and evolving the LSM integration for user ns mounts along
> > with the rest of the user ns integration.
> 
> Ignoring the security attributes is not safe!

Understood. It's surely safe for each LSM to deny such mounts until it
has a way to handle them safely though.

I'm not trying to completely punt on the issue of security modules, just
break this down into more manageable chunks. You've given good guidance
for Smack (thanks very much for that), so I can plan to work on that
soon.

> > Your point is taken about my less-than-expert opinion about the other
> > security modules. We should at minimum get acks from the maintainers of
> > those modules that unprivileged mounts will not compromise MAC.
> 
> I am the Smack maintainer. Unprivileged mounts as you have
> described them compromise MAC. They compromise DAC, too.

It looks like Andy's more or less convinced you that DAC isn't
(additionally?) compromised. And there's a plan for MAC, that the
security module can deny mounts from user namespaces until it has a
solution for allowing them safely.

> > For Smack specifically, I believe my only concern was the SMACK64EXEC
> > attribute, as all the other attributes only affected subjects' access to
> > the files. So maybe it would be possible to simply ignore this attribute
> > in unprivileged mounts and respect the others, even lacking more
> > complete LSM support for user namespaces.
> 
> SMACK64EXEC is analogous to the setuid bit, but I would rather see
> exec() of programs with this attribute refused that for it to be
> blindly ignored.

That's fine, it's your call.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:10       ` Eric W. Biederman
@ 2015-07-17 10:13         ` Lukasz Pawelczyk
  0 siblings, 0 replies; 69+ messages in thread
From: Lukasz Pawelczyk @ 2015-07-17 10:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Seth Forshee, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On czw, 2015-07-16 at 19:10 -0500, Eric W. Biederman wrote:
> Lukasz Pawelczyk <l.pawelczyk@samsung.com> writes:
> > 
> > I fail to see how those 2 are in any conflict. 
> 
> Like I said.  They don't really conflict, and actually to really 
> support
> things well for smack we probably need something like your patches.

As far as I can see now from the discussion the best thing to do would
to be inherit label from a backing store object, or something along
this line.

> At the same time a patch written without dealing with s_user_ns is 
> going
> to going to fail to take a lot of important details into account.

I don't touch anything that would need to deal with s_user_ns. I also
don't change Smack's mounting logic in any way. My patches are
orthogonal to that.

> Right now after fixing the mount namespace issues the top priority is 
> to
> work through the details and get s_user_ns implemented.  By that I 
> mean
> some version of patch 1 of Seth's series.

My priority is to make Smack namespace work. This is a functionality
that has a perfectly valid use case now. Without it Smack in a
container is impossible to operate on.

> s_user_ns fundamentally changes how the concepts are represented in 
> the
> kernel in a way that is easier to secure, and that fundamentally 
> better
> matches things.  And sigh.  This review has shown we don't quite have
> all of the details worked out.
> 
> > If your approach here is to treat user ns mounted filesystem as if 
> > they
> > didn't support xattrs at all then my patches don't conflict here 
> > any
> > more than Smack itself already does.
> 
> The end game if people developing smack choose to play, is to figure 
> out
> how to store your unmapped labels in a filesystem contained by a
> user namespace and a smack label namespace root.

Storing an unmapped label (read: real label) in Smack namespace is
exactly the same as it is now without the namespace. I always store the
real label.

The problem here is: what real label should be "read" and eventually
stored in that filesystem (see my first comment here). Again, Smack
namespace doesn't touch that logic.

> > If the filesystem will get a default (e.g. by smack* mount options)
> > label then this label will co-work with Smack namespaces.
> 
> A default, but I don't know if it will be smack mount options that 
> will
> give that default.  The devil is in the details and there are a lot
> of details.

Now Smack gives the default. If someone will modify Smack to give a
different label because of s_user_ns support Smack namepace will not
cause any hindrance here.

Smack namespace main role is only to be able to operate Smack within a
container. All the other LSM can do that already as they don't require
caps to operate normally. Smack does. Hence it had to be namespaced in
some way to give limited capabilities in a container (user ns).

This really has nothing to do with the way Smack mounts, assigns
labels, decides what is allowed and what is not, etc.

What this discussion is about is how to modify or even bend LSM's way
of work to make unprivileged user ns mounts work under LSM (or not).
Smack namespace here is just an utility within Smack itself. And maybe
it can be used to help this at some point, but beyond that it's
orthogonal to the problem.



-- 
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:42               ` Eric W. Biederman
@ 2015-07-17  2:47                 ` Dave Chinner
  2015-07-21 17:37                   ` J. Bruce Fields
  2015-07-18  0:07                 ` Serge E. Hallyn
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-07-17  2:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Andy Lutomirski, Seth Forshee, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <casey@schaufler-ca.com> writes:
> >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> >> If I mount an unprivileged filesystem, then either the contents were
> >> >> put there *by me*, in which case letting me access them are fine, or
> >> >> (with Seth's patches and then some) I control the backing store, in
> >> >> which case I can do whatever I want regardless of what LSM thinks.
> >> >>
> >> >> So I don't see the problem.  Why would Smack or any other LSM care at
> >> >> all, unless it wants to prevent me from mounting the fs in the first
> >> >> place?
> >> >
> >> > First off, I don't cotton to the notion that you should be able
> >> > to mount filesystems without privilege. But it seems I'm being
> >> > outvoted on that. I suspect that there are cases where it might
> >> > be safe, but I can't think of one off the top of my head.
> >> 
> >> There are two fundamental issues mounting filesystems without privielge,
> >> by which I actually mean mounting filesystems as the root user in a user
> >> namespace.
> >> 
> >> - Are the semantics safe.
> >> - Is the extra attack surface a problem.
> >
> > I think the attack surface this exposes is the biggest problem
> > facing this proposal.
> 
> I completely agree.
> 
> >> Figuring out how to make semantics safe is what we are talking about.
> >> 
> >> Once we sort out the semantics we can look at the handful of filesystems
> >> like fuse where the extra attack surface is not a concern.
> >> 
> >> With that said desktop environments have for a long time been
> >> automatically mounting whichever filesystem you place in your computer,
> >> so in practice what this is really about is trying to align the kernel
> >> with how people use filesystems.
> >
> > The key difference is that desktops only do this when you physically
> > plug in a device. With unprivileged mounts, a hostile attacker
> > doesn't need physical access to the machine to exploit lurking
> > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > they can keep mounting corrupted images until they find something
> > that works.
> 
> Yep.  That magnifies the problem quite a bit.
> 
> > User namespaces are supposed to provide trust separation.  The
> > kernel filesystems simply aren't hardened against unprivileged
> > attacks from below - there is a trust relationship between root and
> > the filesystem in that they are the only things that can write to
> > the disk. Mounts from within a userns destroys this relationship as
> > the userns root, by definition, is not a trusted actor.
> 
> I talked to Ted Tso a while back and ext4 is at least in principle
> already hardened against that kind of attack.  I am not certain I
> believe it, but if it is true I think it is fantastic.

No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.

The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature. No filesystem we have has that capability and AFAIA there
are no plans for any filesystem to implement such tamper detection.
And no, ext4 encryption does not provide this because it only stores
the values and data in encrypted format and does not protect
metadata from tampering when it is not mounted.

If we don't have crypto signatures in metadata, then XFS is probably
the most robust against tampering as it does a lot more checking of
the on-disk metadata before it is used than any other filesystem
(i.e. see the verifier infrastructure that does corruption checks
after read (in io completion) and before write (in io submission)
to catch bad metadata before it is used by the kernel, or before it
is written to disk by the kernel.

However, these checks are far from comprehensive. we can only check
internal consistency of the metadata objects in the block, and even
then we really only can check for values within range rather than
absolute correctness. e.g. we can check a dirent has a valid name,
length, ftype and inode number, but we can't validate that the inode
is actually allocated or not because that requires a lookup in the
allocated inode btree. We *trust* that inode number to be
allocated and valid because it is in metadata the filesystem wrote.

For inode numbers that come from untrusted sources (NFS,
open-by-handle, etc) we have a flag that does inode number
validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
metadata (i.e. the allocated inode btrees), but that is expensive
and so not done on inodes that we pull directly from metadata that
has come from disk. Indeed, we still trust on-disk metadata to be
correct to validate that other metadata canbe trusted, so if one
structure can be tampered with, so can others.

IOWs, if we cannot trust one part of the filesystem metadata to be
correct, then we cannot trust that filesystem *at all*, *for
anything*. And even running fsck doesn't restore trust - all it does
is tell us that any modification that was made is not a detectable
inconsistency that needs fixing.

> At this point any setting of the FS_USER_MOUNT flag I figure needs to go
> through the filesystem maintainers tree and they need to be aware of and
> agree to deal with the attack from below issue.
> 
> The one filesystem I truly expect we can make work is fuse.  fuse has
> been designed to deal with some variation of the attack from below issue
> since day one.  We looked at what the patches to fuse would look like
> with the current state of the vfs and it was not pretty.
> 
> We very much need to sort through as much as possible at the vfs layer,
> and in generic code.  Allow everyone to see what is going on and how
> it works before preceeding forward with enabling any filesystems.

The VFS protects us from attacks from above the filesystem, not
below. The VFS plays no part in validating the on-disk structure of
a filesystem which is what attacks from below will be attempting to
exploit.

> I truly hope we can find a small set of block device filesystems that we
> can harden from attack below.   That would allow linux to have serious
> defenses against evil usb stick attacks.  I think that is going to take
> a lot of careful coding, testing and validation and advancing the state
> of the art to get there.

Somehow, I just can't see that happening.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:45                 ` Casey Schaufler
@ 2015-07-17  0:59                   ` Andy Lutomirski
  2015-07-17 14:28                     ` Serge E. Hallyn
  2015-07-21 20:35                     ` Seth Forshee
  0 siblings, 2 replies; 69+ messages in thread
From: Andy Lutomirski @ 2015-07-17  0:59 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Eric W. Biederman, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>> I really don't see the benefit of making up extra rules that apply to
>> users outside a userns who try to access specifically a filesystem
>> with backing store.  They wouldn't make sense for filesystems without
>> backing store.
>
> Sure it would. For Smack, it would be the label a file would be
> created with, which would be the label of the process creating
> the memory based filesystem. For SELinux the rules are more a
> touch more sophisticated, but I'm sure that Paul or Stephen could
> come up with how to determine it.
>
> The point, looping all the way back to the beginning, where we
> were talking about just ignoring the labels on the filesystem,
> is that if you use the same Smack label on the files in the
> filesystem as the backing store file has, we'll all be happy.
> If that label isn't something user can write to, he won't be
> able to write to the mounted objects, either. If there is no
> backing store then use the label of the process creating the
> filesystem, which will be the user, which will mean everything
> will work hunky dory.
>
> Yes, there's work involved, but I doubt there's a lot. Getting
> the label from the backing store or the creating process is
> simple enough.
>

So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store?  IMO this ought to be doable
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem.  In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.

Presumably Smack could also figure out what was mounted, but keep in
mind that there are filesystems like ntfs-3g out there.  While ntfs-3g
logically has backing store, I don't think the kernel actually knows
about it.

>
>>>>> If you can mount a filesystem such that the labels are ignored you
>>>>> are effectively specifying that the Smack label on the files be
>>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>>>> Without it, it's not.
>>>> Can you explain what the threat model is here?  I don't see what it is
>>>> that you're trying to prevent.
>>> Um, OK.
>>> The filesystem has files with a hundred different Smack labels on it.
>>> I mount it as an unlabeled filesystem and everything is readable by
>>> everyone. Bad jojo.
>> I still don't understand.  If it's a filesystem backed by a file that
>> Seth has RW access to, then Seth can read everything on it, full stop.
>> The security labels in the filesystem are irrelevant.
>
> Well, they can't be trusted, if that's what you mean.
> That's why I'm saying that the objects exposed by mounting
> this backing store need to be treated with the same security
> attributes as the backing store. Fudge it for DAC if you are
> so inclined, but I think it's the right way to go for MAC.
>
>> This is like saying that, if you put restrictive labels in the
>> filesystem that lives on /dev/sda2 and give Seth ownership of
>> /dev/sda2, then you expect Seth to be unable to bypass the policy
>> specifies by your labels.
>
> Consider the Smack label on /dev/sda2. Smack does not care
> who owns it, just what the Smack label is. Just like on
> ~/seth/myfs. The backing store "object" is /dev/sda2 in the
> one case, ~/seth/myfs in the other, and something in the ether
> for a memory based filesystem. So long as the labels of the
> files exposed on the mount point match those of the backing
> store "object", Smack is going to be happy. Since you're
> running without privilege, you can't change the labels on
> the files.
>
> Now Seth, being the sneaky person that he is, could change
> the Smack labels on the files in the backing store while it's
> offline. Since he has access to the backing store, he can't
> give himself more access by changing the labels within the
> filesystem. He can give himself less, but I'm OK with that.
>
>> Or maybe I'm misunderstanding you.
>
> Probably, but I'm undoubtedly doing the same.
>
> If you're going to be at LinuxCon in Seattle we should
> continue this discussion over the beverage of your choice.

There's a small but not quite zero chance I'll be there.  I'll
probably be in Seoul.  It's too bad that LSS and KS are in different
places this year.

--Andy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 23:29               ` Andy Lutomirski
@ 2015-07-17  0:45                 ` Casey Schaufler
  2015-07-17  0:59                   ` Andy Lutomirski
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-17  0:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Seth Forshee, Eric W. Biederman, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 4:08 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> You want to provide a mechanism whereby an unprivileged user (Seth)
>>>> can mount a filesystem for his own use. You want full filesystem
>>>> semantics, but you're willing to accept restrictions on certain
>>>> filesystem features to avoid opening security holes. You are not
>>>> willing to accept restrictions that make the filesystem unusable,
>>>> such as making it read-only.
>>>>
>>>> I am going to present a suggestion. Feel free to correct my
>>>> assumptions and my reasoning. For simplicity let's use loop-back
>>>> mounting of a filesystem contained in a file as an example. The
>>>> principles should apply to newly created memory based filesystems
>>>> or disk partitions "owned" by Seth.
>>>>
>>>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>>>> filesystem. There is already a filesystem object, with security
>>>> attributes, that the system knows how to deal with. If Seth mounts
>>>> this as a filesystem he, and potentially other people, will be
>>>> able to access the content of this object without accessing the
>>>> object itself.
>>>>
>>>>         seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>>>>         seth$ chmod 777 /tmp/seth
>>>>         seth$ ls -la /tmp/seth
>>>>         drwxrwxrwx.  3  seth     seth   260 Jul 16 12:59 .
>>>>         drwxrwxrwxt 18  root     root  4069 Jul 16 11:13 ..
>>>>         seth$
>>>>
>>>> Everything's fine at this point. Wilma is also using the system,
>>>> being the sort who likes to hide things in out of the way places
>>>>
>>>>         wilma$ cp ~/scandals /tmp/seth
>>>>         wilma$ chmod 600 /tmp/seth/scandals
>>> This is already impossible as described.  Seth can only mount the
>>> filesystem in a private mount namespace inside a user namespace that
>>> he created.  Wilma can't see it unless Seth passes an fd to Wilma and
>>> Wilma accepts and uses it.
>> But you do have multiple UIDs withing your user namespace, right?
>> There are processes running as someone other than seth, right?
>>
> Only if root set it up that way.  For example, root could set up
> "subuids" (this is a userspace concept) that belong to Seth.  These
> would be uids that Seth controls and that represent subsets of Seth's
> authority. Wilma wouldn't be one of these subuids unless she was
> somehow part of Seth (or if root completely screwed up).

Or if root had some really unexpected and inappropriate ideas
on what qualifies as "clever". But I'll back off. It looks like
this particular objection of mine is covered.

>
>>>> puts her list of scandals on the unsuspecting filesystem, and changes
>>>> the mode to ensure that no one can find out what went on after the
>>>> office party.
>>>>
>>>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>>>> happened at the office party, and the story goes from there.
>>>>
>>>> Wilma did everything correctly according to the system security policy,
>>>> but the system security policy did not protect her as advertised. The
>>>> system was tricked into behaving as if it was in control of the content
>>>> of the filesystem when in fact it was not.
>>> I would argue that, if Wilma writes to some place described by an fd
>>> and doesn't verify where she's writing to, then she has no expectation
>>> of privacy.  After all, she could just *tell* Seth directly whatever
>>> she wants (assuming she can communicate with Seth in the first place).
>> Don't ascribe either wisdom or good intentions to Wilma.
> In that case, I'll mention the futility of solving the problem, even
> without user namespaces.  If Wilma tells Seth something, he's going to
> find out.  If Wilma pokes it (in whatever form) into an fd provided by
> Seth, then Seth is extremely likely to find out, regardless of what
> root or the MAC owner tries to do.

I'll buy that, too. I still get queasy every time someone
tells me that passing file descriptors is a security feature.

> If Wilma writes to a path that's mounted in her namespace, then, sure,
> overall policy associated with her namespace (which, in your example,
> is the root namespace) must apply.  But Seth can't mount things into
> Wilma's namespace without having CAP_SYS_ADMIN in that namespace and,
> if he has CAP_SYS_ADMIN, it's already game over.

And so long as it's restricted to the namespace ...
I'm starting to get it now.

>>>> One way to fix this problem is for unprivileged mounts to recognize the
>>>> attributes of the object mounted and to propagate those attributes to all
>>>> the objects they present. All files on /tmp/seth would be owned by seth
>>>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
>>> This is impossible to enforce, because Seth could use FUSE instead of ext4.
>> I never said that things aren't already broken. And, if you want
>> to ignore the potential DAC issues (read, negative groups) just
>> do it for the LSM xattrs.
>>
> Negative groups are a solved problem, I believe.

My position is that there's a workaround but that the
design is still fundamentally flawed. 

>
>>>> opening a file on /tmp/seth would require the same permissions as opening
>>>> the file containing the mounted filesystem. These attributes would have to
>>>> be immutable, or at least demonstrably more restrictive (chmod might be
>>>> allowed in some cases, but chown would never be) when changed. I don't see
>>>> how a user other than seth could create a new file, as you'd either have
>>>> a magical change in ownership or a false sense of security.
>>> This would be a very harsh restriction.  Seth might legitimately want
>>> to give a user access to a file on backing store he owns without
>>> giving that user access to the backing store.  Root on a normal system
>>> does that all the time.
>> You already said that it was impossible for Wilma to get
>> access, so how is this more restrictive? Besides, Seth can
>> always set the mode on ~/seth so that Wilma can't read the
>> files it contains. This isn't an old problem or a novel
>> solution.
> Seth can pass an fd around.  This is actually a plausible thing to do:
> Seth creates a userns to sandbox himself, mounts some FUSE thing in
> there, and passes an fd out for the benefit of some daemon.  That
> daemon had better validate the thing before using it, though.

Point. It won't, but it should.

> I really don't see the benefit of making up extra rules that apply to
> users outside a userns who try to access specifically a filesystem
> with backing store.  They wouldn't make sense for filesystems without
> backing store.

Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.

The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.

Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.


>>>> If you can mount a filesystem such that the labels are ignored you
>>>> are effectively specifying that the Smack label on the files be
>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>>> Without it, it's not.
>>> Can you explain what the threat model is here?  I don't see what it is
>>> that you're trying to prevent.
>> Um, OK.
>> The filesystem has files with a hundred different Smack labels on it.
>> I mount it as an unlabeled filesystem and everything is readable by
>> everyone. Bad jojo.
> I still don't understand.  If it's a filesystem backed by a file that
> Seth has RW access to, then Seth can read everything on it, full stop.
> The security labels in the filesystem are irrelevant.

Well, they can't be trusted, if that's what you mean.
That's why I'm saying that the objects exposed by mounting
this backing store need to be treated with the same security
attributes as the backing store. Fudge it for DAC if you are
so inclined, but I think it's the right way to go for MAC.

> This is like saying that, if you put restrictive labels in the
> filesystem that lives on /dev/sda2 and give Seth ownership of
> /dev/sda2, then you expect Seth to be unable to bypass the policy
> specifies by your labels.

Consider the Smack label on /dev/sda2. Smack does not care
who owns it, just what the Smack label is. Just like on
~/seth/myfs. The backing store "object" is /dev/sda2 in the
one case, ~/seth/myfs in the other, and something in the ether
for a memory based filesystem. So long as the labels of the
files exposed on the mount point match those of the backing
store "object", Smack is going to be happy. Since you're
running without privilege, you can't change the labels on
the files.

Now Seth, being the sneaky person that he is, could change
the Smack labels on the files in the backing store while it's
offline. Since he has access to the backing store, he can't
give himself more access by changing the labels within the
filesystem. He can give himself less, but I'm OK with that.

> Or maybe I'm misunderstanding you.

Probably, but I'm undoubtedly doing the same.

If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.

>>>>> Your point is taken about my less-than-expert opinion about the other
>>>>> security modules. We should at minimum get acks from the maintainers of
>>>>> those modules that unprivileged mounts will not compromise MAC.
>>>> I am the Smack maintainer. Unprivileged mounts as you have
>>>> described them compromise MAC. They compromise DAC, too.
>>>>
>>> How do they compromise DAC?
>> Wilma's expectation (or the application running with a mapped UID)
>> that chmod will keep Seth out of the file.
> That was never true.  If Seth has an open fd, Wilma can chmod all day
> and it won't matter.  In this example, Seth owns the entire filesystem
> along with its backing store.
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-17  0:09             ` Dave Chinner
@ 2015-07-17  0:42               ` Eric W. Biederman
  2015-07-17  2:47                 ` Dave Chinner
  2015-07-18  0:07                 ` Serge E. Hallyn
  0 siblings, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-17  0:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Casey Schaufler, Andy Lutomirski, Seth Forshee, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

Dave Chinner <david@fromorbit.com> writes:

> On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <casey@schaufler-ca.com> writes:
>> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
>> >> If I mount an unprivileged filesystem, then either the contents were
>> >> put there *by me*, in which case letting me access them are fine, or
>> >> (with Seth's patches and then some) I control the backing store, in
>> >> which case I can do whatever I want regardless of what LSM thinks.
>> >>
>> >> So I don't see the problem.  Why would Smack or any other LSM care at
>> >> all, unless it wants to prevent me from mounting the fs in the first
>> >> place?
>> >
>> > First off, I don't cotton to the notion that you should be able
>> > to mount filesystems without privilege. But it seems I'm being
>> > outvoted on that. I suspect that there are cases where it might
>> > be safe, but I can't think of one off the top of my head.
>> 
>> There are two fundamental issues mounting filesystems without privielge,
>> by which I actually mean mounting filesystems as the root user in a user
>> namespace.
>> 
>> - Are the semantics safe.
>> - Is the extra attack surface a problem.
>
> I think the attack surface this exposes is the biggest problem
> facing this proposal.

I completely agree.

>> Figuring out how to make semantics safe is what we are talking about.
>> 
>> Once we sort out the semantics we can look at the handful of filesystems
>> like fuse where the extra attack surface is not a concern.
>> 
>> With that said desktop environments have for a long time been
>> automatically mounting whichever filesystem you place in your computer,
>> so in practice what this is really about is trying to align the kernel
>> with how people use filesystems.
>
> The key difference is that desktops only do this when you physically
> plug in a device. With unprivileged mounts, a hostile attacker
> doesn't need physical access to the machine to exploit lurking
> kernel filesystem bugs. i.e. they can just use loopback mounts, and
> they can keep mounting corrupted images until they find something
> that works.

Yep.  That magnifies the problem quite a bit.

> User namespaces are supposed to provide trust separation.  The
> kernel filesystems simply aren't hardened against unprivileged
> attacks from below - there is a trust relationship between root and
> the filesystem in that they are the only things that can write to
> the disk. Mounts from within a userns destroys this relationship as
> the userns root, by definition, is not a trusted actor.

I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack.  I am not certain I
believe it, but if it is true I think it is fantastic.

At this point any setting of the FS_USER_MOUNT flag I figure needs to go
through the filesystem maintainers tree and they need to be aware of and
agree to deal with the attack from below issue.

The one filesystem I truly expect we can make work is fuse.  fuse has
been designed to deal with some variation of the attack from below issue
since day one.  We looked at what the patches to fuse would look like
with the current state of the vfs and it was not pretty.

We very much need to sort through as much as possible at the vfs layer,
and in generic code.  Allow everyone to see what is going on and how
it works before preceeding forward with enabling any filesystems.



I truly hope we can find a small set of block device filesystems that we
can harden from attack below.   That would allow linux to have serious
defenses against evil usb stick attacks.  I think that is going to take
a lot of careful coding, testing and validation and advancing the state
of the art to get there.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 11:16     ` Lukasz Pawelczyk
@ 2015-07-17  0:10       ` Eric W. Biederman
  2015-07-17 10:13         ` Lukasz Pawelczyk
  0 siblings, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-17  0:10 UTC (permalink / raw)
  To: Lukasz Pawelczyk
  Cc: Casey Schaufler, Seth Forshee, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

Lukasz Pawelczyk <l.pawelczyk@samsung.com> writes:

> On śro, 2015-07-15 at 16:06 -0500, Eric W. Biederman wrote:
>> 
>> I am on the fence with Lukasz Pawelczyk's patches.  Some parts I 
>> liked
>> some parts I had issues with.  As I recall one of my issues was that
>> those patches conflicted in detail if not in principle with this
>> appropach.
>> 
>> If these patches do not do a good job of laying the ground work for
>> supporting security labels that unprivileged users can set than Seth
>> could really use some feedback.  Figuring out how to properly deal 
>> with
>> the LSMs has been one of his challenges.
>
> I fail to see how those 2 are in any conflict. 

Like I said.  They don't really conflict, and actually to really support
things well for smack we probably need something like your patches.

At the same time a patch written without dealing with s_user_ns is going
to going to fail to take a lot of important details into account.

Right now after fixing the mount namespace issues the top priority is to
work through the details and get s_user_ns implemented.  By that I mean
some version of patch 1 of Seth's series.

s_user_ns fundamentally changes how the concepts are represented in the
kernel in a way that is easier to secure, and that fundamentally better
matches things.  And sigh.  This review has shown we don't quite have
all of the details worked out.

> If your approach here is to treat user ns mounted filesystem as if they
> didn't support xattrs at all then my patches don't conflict here any
> more than Smack itself already does.

The end game if people developing smack choose to play, is to figure out
how to store your unmapped labels in a filesystem contained by a
user namespace and a smack label namespace root.

> If the filesystem will get a default (e.g. by smack* mount options)
> label then this label will co-work with Smack namespaces.

A default, but I don't know if it will be smack mount options that will
give that default.  The devil is in the details and there are a lot
of details.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  4:47           ` Eric W. Biederman
@ 2015-07-17  0:09             ` Dave Chinner
  2015-07-17  0:42               ` Eric W. Biederman
  2015-07-20 17:54             ` Colin Walters
  1 sibling, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2015-07-17  0:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Andy Lutomirski, Seth Forshee, Alexander Viro,
	Linux FS Devel, LSM List, SELinux-NSA, Serge Hallyn,
	linux-kernel

On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> If I mount an unprivileged filesystem, then either the contents were
> >> put there *by me*, in which case letting me access them are fine, or
> >> (with Seth's patches and then some) I control the backing store, in
> >> which case I can do whatever I want regardless of what LSM thinks.
> >>
> >> So I don't see the problem.  Why would Smack or any other LSM care at
> >> all, unless it wants to prevent me from mounting the fs in the first
> >> place?
> >
> > First off, I don't cotton to the notion that you should be able
> > to mount filesystems without privilege. But it seems I'm being
> > outvoted on that. I suspect that there are cases where it might
> > be safe, but I can't think of one off the top of my head.
> 
> There are two fundamental issues mounting filesystems without privielge,
> by which I actually mean mounting filesystems as the root user in a user
> namespace.
> 
> - Are the semantics safe.
> - Is the extra attack surface a problem.

I think the attack surface this exposes is the biggest problem
facing this proposal.

> Figuring out how to make semantics safe is what we are talking about.
> 
> Once we sort out the semantics we can look at the handful of filesystems
> like fuse where the extra attack surface is not a concern.
> 
> With that said desktop environments have for a long time been
> automatically mounting whichever filesystem you place in your computer,
> so in practice what this is really about is trying to align the kernel
> with how people use filesystems.

The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.

User namespaces are supposed to provide trust separation.  The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 23:08             ` Casey Schaufler
@ 2015-07-16 23:29               ` Andy Lutomirski
  2015-07-17  0:45                 ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2015-07-16 23:29 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Eric W. Biederman, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015 at 4:08 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> You want to provide a mechanism whereby an unprivileged user (Seth)
>>> can mount a filesystem for his own use. You want full filesystem
>>> semantics, but you're willing to accept restrictions on certain
>>> filesystem features to avoid opening security holes. You are not
>>> willing to accept restrictions that make the filesystem unusable,
>>> such as making it read-only.
>>>
>>> I am going to present a suggestion. Feel free to correct my
>>> assumptions and my reasoning. For simplicity let's use loop-back
>>> mounting of a filesystem contained in a file as an example. The
>>> principles should apply to newly created memory based filesystems
>>> or disk partitions "owned" by Seth.
>>>
>>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>>> filesystem. There is already a filesystem object, with security
>>> attributes, that the system knows how to deal with. If Seth mounts
>>> this as a filesystem he, and potentially other people, will be
>>> able to access the content of this object without accessing the
>>> object itself.
>>>
>>>         seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>>>         seth$ chmod 777 /tmp/seth
>>>         seth$ ls -la /tmp/seth
>>>         drwxrwxrwx.  3  seth     seth   260 Jul 16 12:59 .
>>>         drwxrwxrwxt 18  root     root  4069 Jul 16 11:13 ..
>>>         seth$
>>>
>>> Everything's fine at this point. Wilma is also using the system,
>>> being the sort who likes to hide things in out of the way places
>>>
>>>         wilma$ cp ~/scandals /tmp/seth
>>>         wilma$ chmod 600 /tmp/seth/scandals
>> This is already impossible as described.  Seth can only mount the
>> filesystem in a private mount namespace inside a user namespace that
>> he created.  Wilma can't see it unless Seth passes an fd to Wilma and
>> Wilma accepts and uses it.
>
> But you do have multiple UIDs withing your user namespace, right?
> There are processes running as someone other than seth, right?
>

Only if root set it up that way.  For example, root could set up
"subuids" (this is a userspace concept) that belong to Seth.  These
would be uids that Seth controls and that represent subsets of Seth's
authority. Wilma wouldn't be one of these subuids unless she was
somehow part of Seth (or if root completely screwed up).

>>
>>> puts her list of scandals on the unsuspecting filesystem, and changes
>>> the mode to ensure that no one can find out what went on after the
>>> office party.
>>>
>>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>>> happened at the office party, and the story goes from there.
>>>
>>> Wilma did everything correctly according to the system security policy,
>>> but the system security policy did not protect her as advertised. The
>>> system was tricked into behaving as if it was in control of the content
>>> of the filesystem when in fact it was not.
>>
>> I would argue that, if Wilma writes to some place described by an fd
>> and doesn't verify where she's writing to, then she has no expectation
>> of privacy.  After all, she could just *tell* Seth directly whatever
>> she wants (assuming she can communicate with Seth in the first place).
>
> Don't ascribe either wisdom or good intentions to Wilma.

In that case, I'll mention the futility of solving the problem, even
without user namespaces.  If Wilma tells Seth something, he's going to
find out.  If Wilma pokes it (in whatever form) into an fd provided by
Seth, then Seth is extremely likely to find out, regardless of what
root or the MAC owner tries to do.

If Wilma writes to a path that's mounted in her namespace, then, sure,
overall policy associated with her namespace (which, in your example,
is the root namespace) must apply.  But Seth can't mount things into
Wilma's namespace without having CAP_SYS_ADMIN in that namespace and,
if he has CAP_SYS_ADMIN, it's already game over.

>
>>> One way to fix this problem is for unprivileged mounts to recognize the
>>> attributes of the object mounted and to propagate those attributes to all
>>> the objects they present. All files on /tmp/seth would be owned by seth
>>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
>> This is impossible to enforce, because Seth could use FUSE instead of ext4.
>
> I never said that things aren't already broken. And, if you want
> to ignore the potential DAC issues (read, negative groups) just
> do it for the LSM xattrs.
>

Negative groups are a solved problem, I believe.

>
>>
>>> opening a file on /tmp/seth would require the same permissions as opening
>>> the file containing the mounted filesystem. These attributes would have to
>>> be immutable, or at least demonstrably more restrictive (chmod might be
>>> allowed in some cases, but chown would never be) when changed. I don't see
>>> how a user other than seth could create a new file, as you'd either have
>>> a magical change in ownership or a false sense of security.
>> This would be a very harsh restriction.  Seth might legitimately want
>> to give a user access to a file on backing store he owns without
>> giving that user access to the backing store.  Root on a normal system
>> does that all the time.
>
> You already said that it was impossible for Wilma to get
> access, so how is this more restrictive? Besides, Seth can
> always set the mode on ~/seth so that Wilma can't read the
> files it contains. This isn't an old problem or a novel
> solution.

Seth can pass an fd around.  This is actually a plausible thing to do:
Seth creates a userns to sandbox himself, mounts some FUSE thing in
there, and passes an fd out for the benefit of some daemon.  That
daemon had better validate the thing before using it, though.

I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store.  They wouldn't make sense for filesystems without
backing store.

>
>>> If you can mount a filesystem such that the labels are ignored you
>>> are effectively specifying that the Smack label on the files be
>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>> Without it, it's not.
>> Can you explain what the threat model is here?  I don't see what it is
>> that you're trying to prevent.
>
> Um, OK.
> The filesystem has files with a hundred different Smack labels on it.
> I mount it as an unlabeled filesystem and everything is readable by
> everyone. Bad jojo.

I still don't understand.  If it's a filesystem backed by a file that
Seth has RW access to, then Seth can read everything on it, full stop.
The security labels in the filesystem are irrelevant.

This is like saying that, if you put restrictive labels in the
filesystem that lives on /dev/sda2 and give Seth ownership of
/dev/sda2, then you expect Seth to be unable to bypass the policy
specifies by your labels.

Or maybe I'm misunderstanding you.

>
>>
>>>> Your point is taken about my less-than-expert opinion about the other
>>>> security modules. We should at minimum get acks from the maintainers of
>>>> those modules that unprivileged mounts will not compromise MAC.
>>> I am the Smack maintainer. Unprivileged mounts as you have
>>> described them compromise MAC. They compromise DAC, too.
>>>
>> How do they compromise DAC?
>
> Wilma's expectation (or the application running with a mapped UID)
> that chmod will keep Seth out of the file.

That was never true.  If Seth has an open fd, Wilma can chmod all day
and it won't matter.  In this example, Seth owns the entire filesystem
along with its backing store.

--Andy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 22:27           ` Andy Lutomirski
@ 2015-07-16 23:08             ` Casey Schaufler
  2015-07-16 23:29               ` Andy Lutomirski
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-16 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Seth Forshee, Eric W. Biederman, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>> You want to provide a mechanism whereby an unprivileged user (Seth)
>> can mount a filesystem for his own use. You want full filesystem
>> semantics, but you're willing to accept restrictions on certain
>> filesystem features to avoid opening security holes. You are not
>> willing to accept restrictions that make the filesystem unusable,
>> such as making it read-only.
>>
>> I am going to present a suggestion. Feel free to correct my
>> assumptions and my reasoning. For simplicity let's use loop-back
>> mounting of a filesystem contained in a file as an example. The
>> principles should apply to newly created memory based filesystems
>> or disk partitions "owned" by Seth.
>>
>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>> filesystem. There is already a filesystem object, with security
>> attributes, that the system knows how to deal with. If Seth mounts
>> this as a filesystem he, and potentially other people, will be
>> able to access the content of this object without accessing the
>> object itself.
>>
>>         seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>>         seth$ chmod 777 /tmp/seth
>>         seth$ ls -la /tmp/seth
>>         drwxrwxrwx.  3  seth     seth   260 Jul 16 12:59 .
>>         drwxrwxrwxt 18  root     root  4069 Jul 16 11:13 ..
>>         seth$
>>
>> Everything's fine at this point. Wilma is also using the system,
>> being the sort who likes to hide things in out of the way places
>>
>>         wilma$ cp ~/scandals /tmp/seth
>>         wilma$ chmod 600 /tmp/seth/scandals
> This is already impossible as described.  Seth can only mount the
> filesystem in a private mount namespace inside a user namespace that
> he created.  Wilma can't see it unless Seth passes an fd to Wilma and
> Wilma accepts and uses it.

But you do have multiple UIDs withing your user namespace, right?
There are processes running as someone other than seth, right?

>
>> puts her list of scandals on the unsuspecting filesystem, and changes
>> the mode to ensure that no one can find out what went on after the
>> office party.
>>
>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>> happened at the office party, and the story goes from there.
>>
>> Wilma did everything correctly according to the system security policy,
>> but the system security policy did not protect her as advertised. The
>> system was tricked into behaving as if it was in control of the content
>> of the filesystem when in fact it was not.
>
> I would argue that, if Wilma writes to some place described by an fd
> and doesn't verify where she's writing to, then she has no expectation
> of privacy.  After all, she could just *tell* Seth directly whatever
> she wants (assuming she can communicate with Seth in the first place).

Don't ascribe either wisdom or good intentions to Wilma.

>> One way to fix this problem is for unprivileged mounts to recognize the
>> attributes of the object mounted and to propagate those attributes to all
>> the objects they present. All files on /tmp/seth would be owned by seth
>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
> This is impossible to enforce, because Seth could use FUSE instead of ext4.

I never said that things aren't already broken. And, if you want
to ignore the potential DAC issues (read, negative groups) just
do it for the LSM xattrs.


>
>> opening a file on /tmp/seth would require the same permissions as opening
>> the file containing the mounted filesystem. These attributes would have to
>> be immutable, or at least demonstrably more restrictive (chmod might be
>> allowed in some cases, but chown would never be) when changed. I don't see
>> how a user other than seth could create a new file, as you'd either have
>> a magical change in ownership or a false sense of security.
> This would be a very harsh restriction.  Seth might legitimately want
> to give a user access to a file on backing store he owns without
> giving that user access to the backing store.  Root on a normal system
> does that all the time.

You already said that it was impossible for Wilma to get
access, so how is this more restrictive? Besides, Seth can
always set the mode on ~/seth so that Wilma can't read the
files it contains. This isn't an old problem or a novel
solution.

>> If you can mount a filesystem such that the labels are ignored you
>> are effectively specifying that the Smack label on the files be
>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>> Without it, it's not.
> Can you explain what the threat model is here?  I don't see what it is
> that you're trying to prevent.

Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.

>
>>> Your point is taken about my less-than-expert opinion about the other
>>> security modules. We should at minimum get acks from the maintainers of
>>> those modules that unprivileged mounts will not compromise MAC.
>> I am the Smack maintainer. Unprivileged mounts as you have
>> described them compromise MAC. They compromise DAC, too.
>>
> How do they compromise DAC?

Wilma's expectation (or the application running with a mapped UID)
that chmod will keep Seth out of the file.

> --Andy
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 21:42         ` Casey Schaufler
@ 2015-07-16 22:27           ` Andy Lutomirski
  2015-07-16 23:08             ` Casey Schaufler
  2015-07-17 13:21           ` Seth Forshee
  1 sibling, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2015-07-16 22:27 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Eric W. Biederman, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> You want to provide a mechanism whereby an unprivileged user (Seth)
> can mount a filesystem for his own use. You want full filesystem
> semantics, but you're willing to accept restrictions on certain
> filesystem features to avoid opening security holes. You are not
> willing to accept restrictions that make the filesystem unusable,
> such as making it read-only.
>
> I am going to present a suggestion. Feel free to correct my
> assumptions and my reasoning. For simplicity let's use loop-back
> mounting of a filesystem contained in a file as an example. The
> principles should apply to newly created memory based filesystems
> or disk partitions "owned" by Seth.
>
> Seth wants to mount a file (~seth/myfs) which contains an ext4
> filesystem. There is already a filesystem object, with security
> attributes, that the system knows how to deal with. If Seth mounts
> this as a filesystem he, and potentially other people, will be
> able to access the content of this object without accessing the
> object itself.
>
>         seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>         seth$ chmod 777 /tmp/seth
>         seth$ ls -la /tmp/seth
>         drwxrwxrwx.  3  seth     seth   260 Jul 16 12:59 .
>         drwxrwxrwxt 18  root     root  4069 Jul 16 11:13 ..
>         seth$
>
> Everything's fine at this point. Wilma is also using the system,
> being the sort who likes to hide things in out of the way places
>
>         wilma$ cp ~/scandals /tmp/seth
>         wilma$ chmod 600 /tmp/seth/scandals

This is already impossible as described.  Seth can only mount the
filesystem in a private mount namespace inside a user namespace that
he created.  Wilma can't see it unless Seth passes an fd to Wilma and
Wilma accepts and uses it.

>
> puts her list of scandals on the unsuspecting filesystem, and changes
> the mode to ensure that no one can find out what went on after the
> office party.
>
> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
> happened at the office party, and the story goes from there.
>
> Wilma did everything correctly according to the system security policy,
> but the system security policy did not protect her as advertised. The
> system was tricked into behaving as if it was in control of the content
> of the filesystem when in fact it was not.


I would argue that, if Wilma writes to some place described by an fd
and doesn't verify where she's writing to, then she has no expectation
of privacy.  After all, she could just *tell* Seth directly whatever
she wants (assuming she can communicate with Seth in the first place).

>
> One way to fix this problem is for unprivileged mounts to recognize the
> attributes of the object mounted and to propagate those attributes to all
> the objects they present. All files on /tmp/seth would be owned by seth
> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.

This is impossible to enforce, because Seth could use FUSE instead of ext4.

> opening a file on /tmp/seth would require the same permissions as opening
> the file containing the mounted filesystem. These attributes would have to
> be immutable, or at least demonstrably more restrictive (chmod might be
> allowed in some cases, but chown would never be) when changed. I don't see
> how a user other than seth could create a new file, as you'd either have
> a magical change in ownership or a false sense of security.

This would be a very harsh restriction.  Seth might legitimately want
to give a user access to a file on backing store he owns without
giving that user access to the backing store.  Root on a normal system
does that all the time.

> If you can mount a filesystem such that the labels are ignored you
> are effectively specifying that the Smack label on the files be
> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> Without it, it's not.

Can you explain what the threat model is here?  I don't see what it is
that you're trying to prevent.

>> Your point is taken about my less-than-expert opinion about the other
>> security modules. We should at minimum get acks from the maintainers of
>> those modules that unprivileged mounts will not compromise MAC.
>
> I am the Smack maintainer. Unprivileged mounts as you have
> described them compromise MAC. They compromise DAC, too.
>

How do they compromise DAC?

--Andy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 18:57       ` Seth Forshee
@ 2015-07-16 21:42         ` Casey Schaufler
  2015-07-16 22:27           ` Andy Lutomirski
  2015-07-17 13:21           ` Seth Forshee
  0 siblings, 2 replies; 69+ messages in thread
From: Casey Schaufler @ 2015-07-16 21:42 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On 7/16/2015 11:57 AM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 08:09:20AM -0700, Casey Schaufler wrote:
>> On 7/16/2015 6:59 AM, Seth Forshee wrote:
>>> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
>>>> Seth I think for the LSMs we should start with:
>>>>
>>>> diff --git a/security/security.c b/security/security.c
>>>> index 062f3c997fdc..5b6ece92a8e5 100644
>>>> --- a/security/security.c
>>>> +++ b/security/security.c
>>>> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
>>>>  int security_sb_mount(const char *dev_name, struct path *path,
>>>>                         const char *type, unsigned long flags, void *data)
>>>>  {
>>>> +       if (current_user_ns() != &init_user_ns)
>>>> +               return -EPERM;
>>>>         return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
>>>>  }
>>> This just makes it impossible to mount from a user namespace. Every
>>> mount from current_user_ns() != &init_user_ns will fail.
>>>
>>>> Then we should push this down into all of the lsms.
>>>> Then when we should remove or relax or change the check as appropriate
>>>> in each lsm.
>>>>
>>>> The point is this is good enough to see that it is trivially safe,
>>>> and this allows us to focus on the core issues, and stop worrying about
>>>> the lsms for a bit.
>> Given the extent to which LSMs are deployed I find it a bit
>> worrisome that they might not be considered a "core issue".
>>
>>>> Then we can focus on each lsm one at at time and take the time to really
>>>> understand them and talk with their maintainers etc to make certain
>>>> we get things correct.
>> The "Do the easy stuff, fix the hard stuff after we've sold the product"
>> approach works really well until you get to the point of fixing the hard
>> stuff. This is the origin of the 90/90 rule of software development.
>>
>>>> This should remove the need for your patches 5, 6 and 7. For the
>>>> immediate future.
>>> I'm still not entirely sure what you were trying to do, maybe refuse to
>>> mount whenever a security module is loaded? I think this could be a good
>>> option to start, but couldn't we restrict it to only the LSMs which use
>>> xattrs for security labels? In situations where the filesystem cannot
>>> supply security policy metadata I can't think of any reason to disallow
>>> the mounts.
>> This whole notion of mounting a generic filesystem (e.g. ext4) that
>> is "owned" by a user (as opposed to the system) has lots of implications,
>> and I seriously doubt that many of them have been accounted for.
>>
>> Think back to the "negative group access" issue. You can't just
>> ignore issues that are inconvenient, or claim that you have a reasonable
>> system just because *you* can't think of a problem.
> I've spent a lot of time considering the implications and previous
> vulnerabilities, and I've addressed everything I turned up. Now I'm
> asking for review from those with more experience with and expertise of
> the code in question. I'm not sure what more I should be doing.

Part of the problem I see is that you're looking at the details
when there's an architectural issue. That's OK, it happens all
the time, but we have to pull the issue up slightly higher in
order to address the underlying difficulties.

You want to provide a mechanism whereby an unprivileged user (Seth)
can mount a filesystem for his own use. You want full filesystem
semantics, but you're willing to accept restrictions on certain
filesystem features to avoid opening security holes. You are not
willing to accept restrictions that make the filesystem unusable,
such as making it read-only.

I am going to present a suggestion. Feel free to correct my
assumptions and my reasoning. For simplicity let's use loop-back
mounting of a filesystem contained in a file as an example. The
principles should apply to newly created memory based filesystems
or disk partitions "owned" by Seth.

Seth wants to mount a file (~seth/myfs) which contains an ext4
filesystem. There is already a filesystem object, with security
attributes, that the system knows how to deal with. If Seth mounts
this as a filesystem he, and potentially other people, will be
able to access the content of this object without accessing the
object itself.

	seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
	seth$ chmod 777 /tmp/seth
	seth$ ls -la /tmp/seth
	drwxrwxrwx.  3  seth     seth   260 Jul 16 12:59 .
	drwxrwxrwxt 18  root     root  4069 Jul 16 11:13 ..
	seth$

Everything's fine at this point. Wilma is also using the system,
being the sort who likes to hide things in out of the way places

	wilma$ cp ~/scandals /tmp/seth
	wilma$ chmod 600 /tmp/seth/scandals

puts her list of scandals on the unsuspecting filesystem, and changes
the mode to ensure that no one can find out what went on after the
office party.

Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
happened at the office party, and the story goes from there.

Wilma did everything correctly according to the system security policy,
but the system security policy did not protect her as advertised. The
system was tricked into behaving as if it was in control of the content
of the filesystem when in fact it was not.

One way to fix this problem is for unprivileged mounts to recognize the
attributes of the object mounted and to propagate those attributes to all
the objects they present. All files on /tmp/seth would be owned by seth
and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
opening a file on /tmp/seth would require the same permissions as opening
the file containing the mounted filesystem. These attributes would have to
be immutable, or at least demonstrably more restrictive (chmod might be
allowed in some cases, but chown would never be) when changed. I don't see
how a user other than seth could create a new file, as you'd either have
a magical change in ownership or a false sense of security.

I don't see that the presence of user namespaces changes anything. You
may reduce the set of uids available, but the problems with putting a
uid into someone else's file is just as real.

> I welcome feedback about anything I've missed, but stating generally
> that you think I probably missed something isn't very helpful.

True enough. I hope I've explained myself above.

> The LSM issue is thornier than the rest of it though, which is why I
> specifically asked for review there in the cover letter. There's a lot
> of complexity and nuance, and I still don't have a grasp on all the
> subtleties. One such subtlety is the full impact of simply ignoring the
> security labels on disk (but I am still confused as to why this is
> different from filesystems which don't support xattrs at all).

If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be 
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.

> I was unaware of Lukasz's patches until yesterday, and I will have a
> look at them. But since we don't have the LSM support for user
> namespaces yet, I don't see the problem with doing something safe for
> LSMs initially and evolving the LSM integration for user ns mounts along
> with the rest of the user ns integration.

Ignoring the security attributes is not safe!

> Your point is taken about my less-than-expert opinion about the other
> security modules. We should at minimum get acks from the maintainers of
> those modules that unprivileged mounts will not compromise MAC.

I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.


> For Smack specifically, I believe my only concern was the SMACK64EXEC
> attribute, as all the other attributes only affected subjects' access to
> the files. So maybe it would be possible to simply ignore this attribute
> in unprivileged mounts and respect the others, even lacking more
> complete LSM support for user namespaces.

SMACK64EXEC is analogous to the setuid bit, but I would rather see
exec() of programs with this attribute refused that for it to be
blindly ignored.

> Seth
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 15:09     ` Casey Schaufler
@ 2015-07-16 18:57       ` Seth Forshee
  2015-07-16 21:42         ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Seth Forshee @ 2015-07-16 18:57 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On Thu, Jul 16, 2015 at 08:09:20AM -0700, Casey Schaufler wrote:
> On 7/16/2015 6:59 AM, Seth Forshee wrote:
> > On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
> >> Seth I think for the LSMs we should start with:
> >>
> >> diff --git a/security/security.c b/security/security.c
> >> index 062f3c997fdc..5b6ece92a8e5 100644
> >> --- a/security/security.c
> >> +++ b/security/security.c
> >> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
> >>  int security_sb_mount(const char *dev_name, struct path *path,
> >>                         const char *type, unsigned long flags, void *data)
> >>  {
> >> +       if (current_user_ns() != &init_user_ns)
> >> +               return -EPERM;
> >>         return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
> >>  }
> > This just makes it impossible to mount from a user namespace. Every
> > mount from current_user_ns() != &init_user_ns will fail.
> >
> >> Then we should push this down into all of the lsms.
> >> Then when we should remove or relax or change the check as appropriate
> >> in each lsm.
> >>
> >> The point is this is good enough to see that it is trivially safe,
> >> and this allows us to focus on the core issues, and stop worrying about
> >> the lsms for a bit.
> 
> Given the extent to which LSMs are deployed I find it a bit
> worrisome that they might not be considered a "core issue".
> 
> >> Then we can focus on each lsm one at at time and take the time to really
> >> understand them and talk with their maintainers etc to make certain
> >> we get things correct.
> 
> The "Do the easy stuff, fix the hard stuff after we've sold the product"
> approach works really well until you get to the point of fixing the hard
> stuff. This is the origin of the 90/90 rule of software development.
> 
> >>
> >> This should remove the need for your patches 5, 6 and 7. For the
> >> immediate future.
> > I'm still not entirely sure what you were trying to do, maybe refuse to
> > mount whenever a security module is loaded? I think this could be a good
> > option to start, but couldn't we restrict it to only the LSMs which use
> > xattrs for security labels? In situations where the filesystem cannot
> > supply security policy metadata I can't think of any reason to disallow
> > the mounts.
> 
> This whole notion of mounting a generic filesystem (e.g. ext4) that
> is "owned" by a user (as opposed to the system) has lots of implications,
> and I seriously doubt that many of them have been accounted for.
> 
> Think back to the "negative group access" issue. You can't just
> ignore issues that are inconvenient, or claim that you have a reasonable
> system just because *you* can't think of a problem.

I've spent a lot of time considering the implications and previous
vulnerabilities, and I've addressed everything I turned up. Now I'm
asking for review from those with more experience with and expertise of
the code in question. I'm not sure what more I should be doing.

I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.

The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).

I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.

Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.

For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 13:59   ` Seth Forshee
  2015-07-16 15:09     ` Casey Schaufler
@ 2015-07-16 15:59     ` Seth Forshee
  1 sibling, 0 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-16 15:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, linux-fsdevel, linux-security-module, selinux,
	Serge Hallyn, Andy Lutomirski, linux-kernel, Casey Schaufler

On Thu, Jul 16, 2015 at 08:59:47AM -0500, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
> > 
> > Seth I think for the LSMs we should start with:
> > 
> > diff --git a/security/security.c b/security/security.c
> > index 062f3c997fdc..5b6ece92a8e5 100644
> > --- a/security/security.c
> > +++ b/security/security.c
> > @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
> >  int security_sb_mount(const char *dev_name, struct path *path,
> >                         const char *type, unsigned long flags, void *data)
> >  {
> > +       if (current_user_ns() != &init_user_ns)
> > +               return -EPERM;
> >         return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
> >  }
> 
> This just makes it impossible to mount from a user namespace. Every
> mount from current_user_ns() != &init_user_ns will fail.

What might work instead is to add a check in security_sb_kern_mount.
Then it would need to check s_user_ns, that way if proc, sysfs, etc.
use sget_userns(..., &init_user_ns) they can still be mounted in
containers.

It would be nicer to have a hook after sget but before fill_super so
that a bunch of work doesn't have to be done and then undone. Right now
there doesn't seem to be any suitable hook.

> > Then we should push this down into all of the lsms.
> > Then when we should remove or relax or change the check as appropriate
> > in each lsm.
> > 
> > The point is this is good enough to see that it is trivially safe,
> > and this allows us to focus on the core issues, and stop worrying about
> > the lsms for a bit.
> > 
> > Then we can focus on each lsm one at at time and take the time to really
> > understand them and talk with their maintainers etc to make certain
> > we get things correct.
> > 
> > This should remove the need for your patches 5, 6 and 7. For the
> > immediate future.
> 
> I'm still not entirely sure what you were trying to do, maybe refuse to
> mount whenever a security module is loaded? I think this could be a good
> option to start, but couldn't we restrict it to only the LSMs which use
> xattrs for security labels? In situations where the filesystem cannot
> supply security policy metadata I can't think of any reason to disallow
> the mounts.
> 
> Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16 13:59   ` Seth Forshee
@ 2015-07-16 15:09     ` Casey Schaufler
  2015-07-16 18:57       ` Seth Forshee
  2015-07-16 15:59     ` Seth Forshee
  1 sibling, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-16 15:09 UTC (permalink / raw)
  To: Seth Forshee, Eric W. Biederman
  Cc: Alexander Viro, linux-fsdevel, linux-security-module, selinux,
	Serge Hallyn, Andy Lutomirski, linux-kernel

On 7/16/2015 6:59 AM, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
>> Seth I think for the LSMs we should start with:
>>
>> diff --git a/security/security.c b/security/security.c
>> index 062f3c997fdc..5b6ece92a8e5 100644
>> --- a/security/security.c
>> +++ b/security/security.c
>> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
>>  int security_sb_mount(const char *dev_name, struct path *path,
>>                         const char *type, unsigned long flags, void *data)
>>  {
>> +       if (current_user_ns() != &init_user_ns)
>> +               return -EPERM;
>>         return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
>>  }
> This just makes it impossible to mount from a user namespace. Every
> mount from current_user_ns() != &init_user_ns will fail.
>
>> Then we should push this down into all of the lsms.
>> Then when we should remove or relax or change the check as appropriate
>> in each lsm.
>>
>> The point is this is good enough to see that it is trivially safe,
>> and this allows us to focus on the core issues, and stop worrying about
>> the lsms for a bit.

Given the extent to which LSMs are deployed I find it a bit
worrisome that they might not be considered a "core issue".

>> Then we can focus on each lsm one at at time and take the time to really
>> understand them and talk with their maintainers etc to make certain
>> we get things correct.

The "Do the easy stuff, fix the hard stuff after we've sold the product"
approach works really well until you get to the point of fixing the hard
stuff. This is the origin of the 90/90 rule of software development.

>>
>> This should remove the need for your patches 5, 6 and 7. For the
>> immediate future.
> I'm still not entirely sure what you were trying to do, maybe refuse to
> mount whenever a security module is loaded? I think this could be a good
> option to start, but couldn't we restrict it to only the LSMs which use
> xattrs for security labels? In situations where the filesystem cannot
> supply security policy metadata I can't think of any reason to disallow
> the mounts.

This whole notion of mounting a generic filesystem (e.g. ext4) that
is "owned" by a user (as opposed to the system) has lots of implications,
and I seriously doubt that many of them have been accounted for.

Think back to the "negative group access" issue. You can't just
ignore issues that are inconvenient, or claim that you have a reasonable
system just because *you* can't think of a problem.

> Seth
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  3:15 ` Eric W. Biederman
@ 2015-07-16 13:59   ` Seth Forshee
  2015-07-16 15:09     ` Casey Schaufler
  2015-07-16 15:59     ` Seth Forshee
  0 siblings, 2 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-16 13:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, linux-fsdevel, linux-security-module, selinux,
	Serge Hallyn, Andy Lutomirski, linux-kernel, Casey Schaufler

On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
> 
> Seth I think for the LSMs we should start with:
> 
> diff --git a/security/security.c b/security/security.c
> index 062f3c997fdc..5b6ece92a8e5 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
>  int security_sb_mount(const char *dev_name, struct path *path,
>                         const char *type, unsigned long flags, void *data)
>  {
> +       if (current_user_ns() != &init_user_ns)
> +               return -EPERM;
>         return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
>  }

This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.

> Then we should push this down into all of the lsms.
> Then when we should remove or relax or change the check as appropriate
> in each lsm.
> 
> The point is this is good enough to see that it is trivially safe,
> and this allows us to focus on the core issues, and stop worrying about
> the lsms for a bit.
> 
> Then we can focus on each lsm one at at time and take the time to really
> understand them and talk with their maintainers etc to make certain
> we get things correct.
> 
> This should remove the need for your patches 5, 6 and 7. For the
> immediate future.

I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  1:05         ` Andy Lutomirski
  2015-07-16  2:20           ` Eric W. Biederman
@ 2015-07-16 13:12           ` Stephen Smalley
  1 sibling, 0 replies; 69+ messages in thread
From: Stephen Smalley @ 2015-07-16 13:12 UTC (permalink / raw)
  To: Andy Lutomirski, Eric W. Biederman
  Cc: Serge Hallyn, Seth Forshee, linux-kernel, LSM List,
	Alexander Viro, SELinux-NSA, Linux FS Devel

On 07/15/2015 09:05 PM, Andy Lutomirski wrote:
> On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>
>> Seth Forshee <seth.forshee@canonical.com> writes:
>>
>>> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>>>> Casey Schaufler <casey@schaufler-ca.com> writes:
>>>>
>>>>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>>>>> These are the first in a larger set of patches that I've been working on
>>>>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>>>>> filesystems from within user namespaces. I've pushed the full series to:
>>>>>>
>>>>>>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>>>>
>>>>>> Taking the series as a whole, the strategy is to handle as much of the
>>>>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>>>>> handle weird edge cases. If you look at the full series you'll find that
>>>>>> the changes in ext4 to support user namespace mounts turn out to be
>>>>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>>>>> with translating ids for a userspace process which is running in pid and
>>>>>> user namespaces).
>>>>>>
>>>>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>>>>> related code. They fall into two broad groups:
>>>>>>
>>>>>>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>>>>>     pretty straightforward, and Eric has expressed interest in merging
>>>>>>     these patches soon. Note that patch 2 won't apply cleanly without
>>>>>>     Eric's noexec patches for proc and sys [1].
>>>>>>
>>>>>>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>>>>>     &init_user_ns. This includes updates to how file caps and suid are
>>>>>>     handled and LSM updates to ignore security labels on superblocks
>>>>>>     from non-init namespaces.
>>>>>>
>>>>>>     The LSM changes in particular may not be optimal, as I don't have a
>>>>>>     lot of familiarity with this code, so I'd be especially appreciative
>>>>>>     of review of these changes and suggestions on how to improve them.
>>>>>
>>>>> Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
>>>>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>>>>> that make a whole lot more sense than just turning off
>>>>> the option of using labels on files. Gutting the ability
>>>>> to use MAC in a namespace is a step down the road of
>>>>> making MAC and namespaces incompatible.
>>>>
>>>> This is not "turning off the option to use labels on files".
>>>>
>>>> This is supporting mounting filesystems like ext4 by unprivileged users
>>>> and not trusting the labels they set in the same way as we trust labels
>>>> on filesystems mounted by privileged users.
>>>>
>>>> The first step needs to be not trusting those labels and treating such
>>>> filesystems as filesystems without label support.  I hope that is Seth
>>>> has implemented.
>>>>
>>>> In the long run we can do more interesting things with such filesystems
>>>> once the appropriate LSM policy is in place.
>>>
>>> Yes, this exactly. Right now it looks to me like the only safe thing to
>>> do with mounts from unprivileged users is to ignore the security labels,
>>> so that's what I'm trying to do with these changes. If there's some
>>> better thing to do, or some better way to do it, I'm more than happy to
>>> receive that feedback.
>>
>> Ugh.
>>
>> This made me realize that we have an interesting problem here.  An
>> unprivileged mount of tmpfs probably needs to have
>> s_user_ns == &init_user_ns.
>>
>> Otherwise we will break security labels on tmpfs for no good reason.
>> ramfs and sysfs also seem to have similar concerns.
>>
>> Because they have no backing store we can trust those filesystems with
>> security labels.  Plus for at least sysfs there is the security label
>> bleed through issue, that we need to make certain works.
>>
>> Perhaps these filesystems with trusted backing store need to call
>> "sget_userns(..., &init_user_ns)".
>>
>> If we don't get this right we will have significant regressions with
>> respect to security labels, and that is not ok.
> 
> That's only a problem if there's anyone who sets security labels on
> such a mount.  You need global caps to do that (I hope), which
> requires someone outside the userns to help, which means there's a
> good chance that literally no one does this.

Setting of security.selinux attributes is governed by SELinux permission
checks, not by capabilities.

Also, files are always assigned a label at creation time; a tmpfs inode
will be labeled based on its creator without any userspace entity ever
calling setxattr() at all.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 21:06   ` Eric W. Biederman
  2015-07-15 21:48     ` Seth Forshee
  2015-07-15 22:39     ` Casey Schaufler
@ 2015-07-16 11:16     ` Lukasz Pawelczyk
  2015-07-17  0:10       ` Eric W. Biederman
  2 siblings, 1 reply; 69+ messages in thread
From: Lukasz Pawelczyk @ 2015-07-16 11:16 UTC (permalink / raw)
  To: Eric W. Biederman, Casey Schaufler
  Cc: Seth Forshee, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On śro, 2015-07-15 at 16:06 -0500, Eric W. Biederman wrote:
> 
> I am on the fence with Lukasz Pawelczyk's patches.  Some parts I 
> liked
> some parts I had issues with.  As I recall one of my issues was that
> those patches conflicted in detail if not in principle with this
> appropach.
> 
> If these patches do not do a good job of laying the ground work for
> supporting security labels that unprivileged users can set than Seth
> could really use some feedback.  Figuring out how to properly deal 
> with
> the LSMs has been one of his challenges.

I fail to see how those 2 are in any conflict. Smack namespace is just
a mean of limiting the view of Smack labels within user namespace, to
be able to give some limited capabilities to processes in the namespace
to make it possible to partially administer Smack there. It doesn't
change Smack behaviour or mode of operation in any way.

If your approach here is to treat user ns mounted filesystem as if they
didn't support xattrs at all then my patches don't conflict here any
more than Smack itself already does.

If the filesystem will get a default (e.g. by smack* mount options)
label then this label will co-work with Smack namespaces.


-- 
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics




^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  2:54         ` Casey Schaufler
@ 2015-07-16  4:47           ` Eric W. Biederman
  2015-07-17  0:09             ` Dave Chinner
  2015-07-20 17:54             ` Colin Walters
  0 siblings, 2 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-16  4:47 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

Casey Schaufler <casey@schaufler-ca.com> writes:

> On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
>> On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>>>> Casey Schaufler <casey@schaufler-ca.com> writes:
>>>> The first step needs to be not trusting those labels and treating such
>>>> filesystems as filesystems without label support.  I hope that is Seth
>>>> has implemented.
>>> A filesystem with Smack labels gets mounted in a namespace. The labels
>>> are ignored. Instead, the filesystem defaults (potentially specified as
>>> mount options smackfsdef="something", but usually the floor label ("_"))
>>> are used, giving the user the ability to read everything and (usually)
>>> change nothing. This is both dangerous (unintended read access to files)
>>> and pointless (can't make changes).
>> I don't get it.
>>
>> If I mount an unprivileged filesystem, then either the contents were
>> put there *by me*, in which case letting me access them are fine, or
>> (with Seth's patches and then some) I control the backing store, in
>> which case I can do whatever I want regardless of what LSM thinks.
>>
>> So I don't see the problem.  Why would Smack or any other LSM care at
>> all, unless it wants to prevent me from mounting the fs in the first
>> place?
>
> First off, I don't cotton to the notion that you should be able
> to mount filesystems without privilege. But it seems I'm being
> outvoted on that. I suspect that there are cases where it might
> be safe, but I can't think of one off the top of my head.

There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.

- Are the semantics safe.
- Is the extra attack surface a problem.

Figuring out how to make semantics safe is what we are talking about.

Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.

With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.

I haven't looked closely but I think docker is just about as bad as
those desktop environments when it comes to mounting filesystems.

> If you do mount a filesystem it needs to behave according to the
> rules of the system.

I agree.

> If you have a security module that uses
> attributes on the filesystem you can't ignore them just because
> it's "your data". Mandatory access control schemes, including
> Smack and SELinux don't give a fig about who you are. It's the
> label on the data and the process that matter. If "you" get to
> muck the labels up, you've broken the mandatory access control.

So there are filesystems like fat and minix that can not store a label.
Since it is not possible to store labels securely in filesystems mounted
by unprivileged users (at least in the normal sense) the intent would be
to treat a filesystem mounted without the privileges of the global root
user as a filesystem that does not support xattrs.

Treating such a filesystem as a filesystem that does not support xattrs
is the only possible way support such a filesystem securely, because as
you have said someone who can muck up the labels breaks mandatory access
control.

Given how non-trivial it is to grasp the nuances of different lsms
mandatory access control semantics, I am asking Seth for the first past
to simply forbid mounting of filesystems with just user namespace
permissions when there is an lsm active.

Once we get that far smack may never need to support such systems.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 19:46 Seth Forshee
  2015-07-15 20:36 ` Casey Schaufler
@ 2015-07-16  3:15 ` Eric W. Biederman
  2015-07-16 13:59   ` Seth Forshee
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-16  3:15 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Alexander Viro, linux-fsdevel, linux-security-module, selinux,
	Serge Hallyn, Andy Lutomirski, linux-kernel, Casey Schaufler


Seth I think for the LSMs we should start with:

diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
 int security_sb_mount(const char *dev_name, struct path *path,
                        const char *type, unsigned long flags, void *data)
 {
+       if (current_user_ns() != &init_user_ns)
+               return -EPERM;
        return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
 }


Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.

The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.

Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.

This should remove the need for your patches 5, 6 and 7. For the
immediate future.

Eric

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  1:08       ` Andy Lutomirski
@ 2015-07-16  2:54         ` Casey Schaufler
  2015-07-16  4:47           ` Eric W. Biederman
  0 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-16  2:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>>> Casey Schaufler <casey@schaufler-ca.com> writes:
>>> The first step needs to be not trusting those labels and treating such
>>> filesystems as filesystems without label support.  I hope that is Seth
>>> has implemented.
>> A filesystem with Smack labels gets mounted in a namespace. The labels
>> are ignored. Instead, the filesystem defaults (potentially specified as
>> mount options smackfsdef="something", but usually the floor label ("_"))
>> are used, giving the user the ability to read everything and (usually)
>> change nothing. This is both dangerous (unintended read access to files)
>> and pointless (can't make changes).
> I don't get it.
>
> If I mount an unprivileged filesystem, then either the contents were
> put there *by me*, in which case letting me access them are fine, or
> (with Seth's patches and then some) I control the backing store, in
> which case I can do whatever I want regardless of what LSM thinks.
>
> So I don't see the problem.  Why would Smack or any other LSM care at
> all, unless it wants to prevent me from mounting the fs in the first
> place?

First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.

If you do mount a filesystem it needs to behave according to the
rules of the system. If you have a security module that uses
attributes on the filesystem you can't ignore them just because
it's "your data". Mandatory access control schemes, including
Smack and SELinux don't give a fig about who you are. It's the
label on the data and the process that matter. If "you" get to
muck the labels up, you've broken the mandatory access control.

> --Andy


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-16  1:05         ` Andy Lutomirski
@ 2015-07-16  2:20           ` Eric W. Biederman
  2015-07-16 13:12           ` Stephen Smalley
  1 sibling, 0 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-16  2:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: SELinux-NSA, Serge Hallyn, Alexander Viro, linux-kernel,
	LSM List, Linux FS Devel, Casey Schaufler, Seth Forshee

Andy Lutomirski <luto@amacapital.net> writes:

> On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>
>> Seth Forshee <seth.forshee@canonical.com> writes:
>>
>> > On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> >> Casey Schaufler <casey@schaufler-ca.com> writes:
>> >>
>> >> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> >> >> These are the first in a larger set of patches that I've been working on
>> >> >> (with help from Eric Biederman) to support mounting ext4 and fuse
>> >> >> filesystems from within user namespaces. I've pushed the full series to:
>> >> >>
>> >> >>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>> >> >>
>> >> >> Taking the series as a whole, the strategy is to handle as much of the
>> >> >> heavy lifting as possible in the vfs so the filesystems don't have to
>> >> >> handle weird edge cases. If you look at the full series you'll find that
>> >> >> the changes in ext4 to support user namespace mounts turn out to be
>> >> >> fairly minimal (fuse is a bit more complicated though as it must deal
>> >> >> with translating ids for a userspace process which is running in pid and
>> >> >> user namespaces).
>> >> >>
>> >> >> The patches I'm sending today lay some of the groundwork in the vfs and
>> >> >> related code. They fall into two broad groups:
>> >> >>
>> >> >>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>> >> >>     pretty straightforward, and Eric has expressed interest in merging
>> >> >>     these patches soon. Note that patch 2 won't apply cleanly without
>> >> >>     Eric's noexec patches for proc and sys [1].
>> >> >>
>> >> >>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>> >> >>     &init_user_ns. This includes updates to how file caps and suid are
>> >> >>     handled and LSM updates to ignore security labels on superblocks
>> >> >>     from non-init namespaces.
>> >> >>
>> >> >>     The LSM changes in particular may not be optimal, as I don't have a
>> >> >>     lot of familiarity with this code, so I'd be especially appreciative
>> >> >>     of review of these changes and suggestions on how to improve them.
>> >> >
>> >> > Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
>> >> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> >> > that make a whole lot more sense than just turning off
>> >> > the option of using labels on files. Gutting the ability
>> >> > to use MAC in a namespace is a step down the road of
>> >> > making MAC and namespaces incompatible.
>> >>
>> >> This is not "turning off the option to use labels on files".
>> >>
>> >> This is supporting mounting filesystems like ext4 by unprivileged users
>> >> and not trusting the labels they set in the same way as we trust labels
>> >> on filesystems mounted by privileged users.
>> >>
>> >> The first step needs to be not trusting those labels and treating such
>> >> filesystems as filesystems without label support.  I hope that is Seth
>> >> has implemented.
>> >>
>> >> In the long run we can do more interesting things with such filesystems
>> >> once the appropriate LSM policy is in place.
>> >
>> > Yes, this exactly. Right now it looks to me like the only safe thing to
>> > do with mounts from unprivileged users is to ignore the security labels,
>> > so that's what I'm trying to do with these changes. If there's some
>> > better thing to do, or some better way to do it, I'm more than happy to
>> > receive that feedback.
>>
>> Ugh.
>>
>> This made me realize that we have an interesting problem here.  An
>> unprivileged mount of tmpfs probably needs to have
>> s_user_ns == &init_user_ns.
>>
>> Otherwise we will break security labels on tmpfs for no good reason.
>> ramfs and sysfs also seem to have similar concerns.
>>
>> Because they have no backing store we can trust those filesystems with
>> security labels.  Plus for at least sysfs there is the security label
>> bleed through issue, that we need to make certain works.
>>
>> Perhaps these filesystems with trusted backing store need to call
>> "sget_userns(..., &init_user_ns)".
>>
>> If we don't get this right we will have significant regressions with
>> respect to security labels, and that is not ok.
>
> That's only a problem if there's anyone who sets security labels on
> such a mount.  You need global caps to do that (I hope), which
> requires someone outside the userns to help, which means there's a
> good chance that literally no one does this.

Fair enough.  That is however something we need to test.  If no one
puts security labels or file caps on such a mount we can change things.
If not we can't because it would introduce regressions.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 22:39     ` Casey Schaufler
@ 2015-07-16  1:08       ` Andy Lutomirski
  2015-07-16  2:54         ` Casey Schaufler
  0 siblings, 1 reply; 69+ messages in thread
From: Andy Lutomirski @ 2015-07-16  1:08 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Eric W. Biederman, Seth Forshee, Alexander Viro, Linux FS Devel,
	LSM List, SELinux-NSA, Serge Hallyn, linux-kernel

On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>> Casey Schaufler <casey@schaufler-ca.com> writes:
>
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support.  I hope that is Seth
>> has implemented.
>
> A filesystem with Smack labels gets mounted in a namespace. The labels
> are ignored. Instead, the filesystem defaults (potentially specified as
> mount options smackfsdef="something", but usually the floor label ("_"))
> are used, giving the user the ability to read everything and (usually)
> change nothing. This is both dangerous (unintended read access to files)
> and pointless (can't make changes).

I don't get it.

If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.

So I don't see the problem.  Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?

--Andy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 22:28       ` Eric W. Biederman
@ 2015-07-16  1:05         ` Andy Lutomirski
  2015-07-16  2:20           ` Eric W. Biederman
  2015-07-16 13:12           ` Stephen Smalley
  0 siblings, 2 replies; 69+ messages in thread
From: Andy Lutomirski @ 2015-07-16  1:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: SELinux-NSA, Serge Hallyn, Alexander Viro, linux-kernel,
	LSM List, Linux FS Devel, Casey Schaufler, Seth Forshee

On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> Seth Forshee <seth.forshee@canonical.com> writes:
>
> > On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <casey@schaufler-ca.com> writes:
> >>
> >> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
> >> >> These are the first in a larger set of patches that I've been working on
> >> >> (with help from Eric Biederman) to support mounting ext4 and fuse
> >> >> filesystems from within user namespaces. I've pushed the full series to:
> >> >>
> >> >>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
> >> >>
> >> >> Taking the series as a whole, the strategy is to handle as much of the
> >> >> heavy lifting as possible in the vfs so the filesystems don't have to
> >> >> handle weird edge cases. If you look at the full series you'll find that
> >> >> the changes in ext4 to support user namespace mounts turn out to be
> >> >> fairly minimal (fuse is a bit more complicated though as it must deal
> >> >> with translating ids for a userspace process which is running in pid and
> >> >> user namespaces).
> >> >>
> >> >> The patches I'm sending today lay some of the groundwork in the vfs and
> >> >> related code. They fall into two broad groups:
> >> >>
> >> >>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
> >> >>     pretty straightforward, and Eric has expressed interest in merging
> >> >>     these patches soon. Note that patch 2 won't apply cleanly without
> >> >>     Eric's noexec patches for proc and sys [1].
> >> >>
> >> >>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
> >> >>     &init_user_ns. This includes updates to how file caps and suid are
> >> >>     handled and LSM updates to ignore security labels on superblocks
> >> >>     from non-init namespaces.
> >> >>
> >> >>     The LSM changes in particular may not be optimal, as I don't have a
> >> >>     lot of familiarity with this code, so I'd be especially appreciative
> >> >>     of review of these changes and suggestions on how to improve them.
> >> >
> >> > Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
> >> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
> >> > that make a whole lot more sense than just turning off
> >> > the option of using labels on files. Gutting the ability
> >> > to use MAC in a namespace is a step down the road of
> >> > making MAC and namespaces incompatible.
> >>
> >> This is not "turning off the option to use labels on files".
> >>
> >> This is supporting mounting filesystems like ext4 by unprivileged users
> >> and not trusting the labels they set in the same way as we trust labels
> >> on filesystems mounted by privileged users.
> >>
> >> The first step needs to be not trusting those labels and treating such
> >> filesystems as filesystems without label support.  I hope that is Seth
> >> has implemented.
> >>
> >> In the long run we can do more interesting things with such filesystems
> >> once the appropriate LSM policy is in place.
> >
> > Yes, this exactly. Right now it looks to me like the only safe thing to
> > do with mounts from unprivileged users is to ignore the security labels,
> > so that's what I'm trying to do with these changes. If there's some
> > better thing to do, or some better way to do it, I'm more than happy to
> > receive that feedback.
>
> Ugh.
>
> This made me realize that we have an interesting problem here.  An
> unprivileged mount of tmpfs probably needs to have
> s_user_ns == &init_user_ns.
>
> Otherwise we will break security labels on tmpfs for no good reason.
> ramfs and sysfs also seem to have similar concerns.
>
> Because they have no backing store we can trust those filesystems with
> security labels.  Plus for at least sysfs there is the security label
> bleed through issue, that we need to make certain works.
>
> Perhaps these filesystems with trusted backing store need to call
> "sget_userns(..., &init_user_ns)".
>
> If we don't get this right we will have significant regressions with
> respect to security labels, and that is not ok.

That's only a problem if there's anyone who sets security labels on
such a mount.  You need global caps to do that (I hope), which
requires someone outside the userns to help, which means there's a
good chance that literally no one does this.

--Andy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 21:48     ` Seth Forshee
  2015-07-15 22:28       ` Eric W. Biederman
@ 2015-07-15 23:04       ` Casey Schaufler
  1 sibling, 0 replies; 69+ messages in thread
From: Casey Schaufler @ 2015-07-15 23:04 UTC (permalink / raw)
  To: Seth Forshee, Eric W. Biederman
  Cc: Alexander Viro, linux-fsdevel, linux-security-module, selinux,
	Serge Hallyn, Andy Lutomirski, linux-kernel

On 7/15/2015 2:48 PM, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <casey@schaufler-ca.com> writes:
>>
>>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>>> These are the first in a larger set of patches that I've been working on
>>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>>> filesystems from within user namespaces. I've pushed the full series to:
>>>>
>>>>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>>
>>>> Taking the series as a whole, the strategy is to handle as much of the
>>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>>> handle weird edge cases. If you look at the full series you'll find that
>>>> the changes in ext4 to support user namespace mounts turn out to be
>>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>>> with translating ids for a userspace process which is running in pid and
>>>> user namespaces).
>>>>
>>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>>> related code. They fall into two broad groups:
>>>>
>>>>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>>>     pretty straightforward, and Eric has expressed interest in merging
>>>>     these patches soon. Note that patch 2 won't apply cleanly without
>>>>     Eric's noexec patches for proc and sys [1].
>>>>
>>>>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>>>     &init_user_ns. This includes updates to how file caps and suid are
>>>>     handled and LSM updates to ignore security labels on superblocks
>>>>     from non-init namespaces.
>>>>
>>>>     The LSM changes in particular may not be optimal, as I don't have a
>>>>     lot of familiarity with this code, so I'd be especially appreciative
>>>>     of review of these changes and suggestions on how to improve them.
>>> Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
>>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>>> that make a whole lot more sense than just turning off
>>> the option of using labels on files. Gutting the ability
>>> to use MAC in a namespace is a step down the road of
>>> making MAC and namespaces incompatible.
>> This is not "turning off the option to use labels on files".
>>
>> This is supporting mounting filesystems like ext4 by unprivileged users
>> and not trusting the labels they set in the same way as we trust labels
>> on filesystems mounted by privileged users.
>>
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support.  I hope that is Seth
>> has implemented.
>>
>> In the long run we can do more interesting things with such filesystems
>> once the appropriate LSM policy is in place.
> Yes, this exactly. Right now it looks to me like the only safe thing to
> do with mounts from unprivileged users is to ignore the security labels,
> so that's what I'm trying to do with these changes. If there's some
> better thing to do, or some better way to do it, I'm more than happy to
> receive that feedback.

If you ignore Smack labels you get a system that is broken.
Without specifying Smack mount options (requires CAP_MAC_ADMIN)
all your files will be labeled with the floor ("_") label. Unless
you're running with the floor label (Smack systems generally don't)
there won't be anything you can write to. You will be able to read
everything, which is also something you're unlikely to want. Like
I said, broken.

Personally, I don't believe that the goal of supporting
unprivileged mounts is especially sane. I am willing to
be educated, but I don't see a rational solution.

> Seth


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 21:06   ` Eric W. Biederman
  2015-07-15 21:48     ` Seth Forshee
@ 2015-07-15 22:39     ` Casey Schaufler
  2015-07-16  1:08       ` Andy Lutomirski
  2015-07-16 11:16     ` Lukasz Pawelczyk
  2 siblings, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-15 22:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Seth Forshee, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
>
>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>> These are the first in a larger set of patches that I've been working on
>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>> filesystems from within user namespaces. I've pushed the full series to:
>>>
>>>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>
>>> Taking the series as a whole, the strategy is to handle as much of the
>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>> handle weird edge cases. If you look at the full series you'll find that
>>> the changes in ext4 to support user namespace mounts turn out to be
>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>> with translating ids for a userspace process which is running in pid and
>>> user namespaces).
>>>
>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>> related code. They fall into two broad groups:
>>>
>>>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>>     pretty straightforward, and Eric has expressed interest in merging
>>>     these patches soon. Note that patch 2 won't apply cleanly without
>>>     Eric's noexec patches for proc and sys [1].
>>>
>>>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>>     &init_user_ns. This includes updates to how file caps and suid are
>>>     handled and LSM updates to ignore security labels on superblocks
>>>     from non-init namespaces.
>>>
>>>     The LSM changes in particular may not be optimal, as I don't have a
>>>     lot of familiarity with this code, so I'd be especially appreciative
>>>     of review of these changes and suggestions on how to improve them.
>> Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> that make a whole lot more sense than just turning off
>> the option of using labels on files. Gutting the ability
>> to use MAC in a namespace is a step down the road of
>> making MAC and namespaces incompatible.
> This is not "turning off the option to use labels on files".

It gives an unprivileged user the ability to ignore
the Smack labels that are on files and to create files
with labels that do not match the rules laid down by the
security module.

> This is supporting mounting filesystems like ext4 by unprivileged users
> and not trusting the labels they set in the same way as we trust labels
> on filesystems mounted by privileged users.

OK, you don't trust the metadata on a filesystem mounted by an untrusted
user. That's fair. 


> The first step needs to be not trusting those labels and treating such
> filesystems as filesystems without label support.  I hope that is Seth
> has implemented.

A filesystem with Smack labels gets mounted in a namespace. The labels
are ignored. Instead, the filesystem defaults (potentially specified as
mount options smackfsdef="something", but usually the floor label ("_"))
are used, giving the user the ability to read everything and (usually)
change nothing. This is both dangerous (unintended read access to files)
and pointless (can't make changes).

I can't speak authoritatively for SELinux, but it looks to me like you
may have similar issues there.

> In the long run we can do more interesting things with such filesystems
> once the appropriate LSM policy is in place.

The problem is not that the short term behavior is uninteresting,
it's that it is broken. Mounting a filesystem with xattrs and ignoring
those xattrs results in incorrect access control decisions.

> Getting s_user_ns present on struct super, properly set, and all of the
> appropriate checks against it present in the vfs so that filesystems
> don't need to duplicate logic is important if we are going do more
> interesting things with user namespaces (as users have been asking for).

OK, but the fact that someone wants to do something they shouldn't
doesn't mean you get to break things that work now to accommodate
them. There are reasons why mounting filesystems requires privilege!

> It is important for things as small as making it safe to allow
> truly unprivileged users to mount fuse filesystems.

If it isn't safe you shouldn't be doing it, even if it's "small"
and something that would make life easier for some set of users.

> I am on the fence with Lukasz Pawelczyk's patches.  Some parts I liked
> some parts I had issues with.  As I recall one of my issues was that
> those patches conflicted in detail if not in principle with this
> appropach.
>
> If these patches do not do a good job of laying the ground work for
> supporting security labels that unprivileged users can set than Seth
> could really use some feedback.  Figuring out how to properly deal with
> the LSMs has been one of his challenges.

The feedback is that you can't pick and
choose when you are going to pay attention to the security attributes
on a filesystem. It's possible that it will work out the way you want
it, but it probably won't. Smack doesn't allow you to choose if you're
using xattrs. SELinux does, but certainly doesn't expect you to be
flipping it on and off. I'm not convinced that it's safe to do for
capability sets, either, but I'm not up to arguing PIxFE+ vector
calculations just now.

> I am hoping I can finishing working through the patches to fix the
> semantics of rename and bind mounts before the next merge window opens,
> so I can have enough cycles to lift the feature freeze on user
> namespaces.  Except for maybe his first two patches (which fix a small
> userspace API breakage) none of Seth's patches get to go in until I lift
> the freeze.

Thanks. I know (believe me, I know) how frustrating it can be when
you get the big NAK on something that seems like it's addressed.
Unfortunately, the proposed approach (not just the specifics of
implementation) does not work. 

> Which is probably too much information but I hope this makes it clear
> that the point of this work is as an enabler for future developments,
> not as something to make user namespaces and LSMs incompatible.

I am paranoid, but not to the extent that I think anyone
is trying to break the interaction between security modules
and namespaces. Having worked with Lukasz on his security
namespace patches it is clear to me that this is not a simple
problem and that it is unlikely to have the simple solution
everyone would like to see. I also don't see an intermediate
state that works while the "real" solution is being refined.
As always, I'm willing to be proven wrong.

> Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 21:48     ` Seth Forshee
@ 2015-07-15 22:28       ` Eric W. Biederman
  2015-07-16  1:05         ` Andy Lutomirski
  2015-07-15 23:04       ` Casey Schaufler
  1 sibling, 1 reply; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-15 22:28 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Casey Schaufler, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

Seth Forshee <seth.forshee@canonical.com> writes:

> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <casey@schaufler-ca.com> writes:
>> 
>> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> >> These are the first in a larger set of patches that I've been working on
>> >> (with help from Eric Biederman) to support mounting ext4 and fuse
>> >> filesystems from within user namespaces. I've pushed the full series to:
>> >>
>> >>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>> >>
>> >> Taking the series as a whole, the strategy is to handle as much of the
>> >> heavy lifting as possible in the vfs so the filesystems don't have to
>> >> handle weird edge cases. If you look at the full series you'll find that
>> >> the changes in ext4 to support user namespace mounts turn out to be
>> >> fairly minimal (fuse is a bit more complicated though as it must deal
>> >> with translating ids for a userspace process which is running in pid and
>> >> user namespaces).
>> >>
>> >> The patches I'm sending today lay some of the groundwork in the vfs and
>> >> related code. They fall into two broad groups:
>> >>
>> >>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>> >>     pretty straightforward, and Eric has expressed interest in merging
>> >>     these patches soon. Note that patch 2 won't apply cleanly without
>> >>     Eric's noexec patches for proc and sys [1].
>> >>
>> >>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>> >>     &init_user_ns. This includes updates to how file caps and suid are
>> >>     handled and LSM updates to ignore security labels on superblocks
>> >>     from non-init namespaces.
>> >>
>> >>     The LSM changes in particular may not be optimal, as I don't have a
>> >>     lot of familiarity with this code, so I'd be especially appreciative
>> >>     of review of these changes and suggestions on how to improve them.
>> >
>> > Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
>> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> > that make a whole lot more sense than just turning off
>> > the option of using labels on files. Gutting the ability
>> > to use MAC in a namespace is a step down the road of
>> > making MAC and namespaces incompatible.
>> 
>> This is not "turning off the option to use labels on files".
>> 
>> This is supporting mounting filesystems like ext4 by unprivileged users
>> and not trusting the labels they set in the same way as we trust labels
>> on filesystems mounted by privileged users.
>> 
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support.  I hope that is Seth
>> has implemented.
>> 
>> In the long run we can do more interesting things with such filesystems
>> once the appropriate LSM policy is in place.
>
> Yes, this exactly. Right now it looks to me like the only safe thing to
> do with mounts from unprivileged users is to ignore the security labels,
> so that's what I'm trying to do with these changes. If there's some
> better thing to do, or some better way to do it, I'm more than happy to
> receive that feedback.

Ugh.

This made me realize that we have an interesting problem here.  An
unprivileged mount of tmpfs probably needs to have
s_user_ns == &init_user_ns.

Otherwise we will break security labels on tmpfs for no good reason.
ramfs and sysfs also seem to have similar concerns.

Because they have no backing store we can trust those filesystems with
security labels.  Plus for at least sysfs there is the security label
bleed through issue, that we need to make certain works.

Perhaps these filesystems with trusted backing store need to call
"sget_userns(..., &init_user_ns)".

If we don't get this right we will have significant regressions with
respect to security labels, and that is not ok.

Eric

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 21:06   ` Eric W. Biederman
@ 2015-07-15 21:48     ` Seth Forshee
  2015-07-15 22:28       ` Eric W. Biederman
  2015-07-15 23:04       ` Casey Schaufler
  2015-07-15 22:39     ` Casey Schaufler
  2015-07-16 11:16     ` Lukasz Pawelczyk
  2 siblings, 2 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-15 21:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Casey Schaufler, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <casey@schaufler-ca.com> writes:
> 
> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
> >> These are the first in a larger set of patches that I've been working on
> >> (with help from Eric Biederman) to support mounting ext4 and fuse
> >> filesystems from within user namespaces. I've pushed the full series to:
> >>
> >>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
> >>
> >> Taking the series as a whole, the strategy is to handle as much of the
> >> heavy lifting as possible in the vfs so the filesystems don't have to
> >> handle weird edge cases. If you look at the full series you'll find that
> >> the changes in ext4 to support user namespace mounts turn out to be
> >> fairly minimal (fuse is a bit more complicated though as it must deal
> >> with translating ids for a userspace process which is running in pid and
> >> user namespaces).
> >>
> >> The patches I'm sending today lay some of the groundwork in the vfs and
> >> related code. They fall into two broad groups:
> >>
> >>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
> >>     pretty straightforward, and Eric has expressed interest in merging
> >>     these patches soon. Note that patch 2 won't apply cleanly without
> >>     Eric's noexec patches for proc and sys [1].
> >>
> >>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
> >>     &init_user_ns. This includes updates to how file caps and suid are
> >>     handled and LSM updates to ignore security labels on superblocks
> >>     from non-init namespaces.
> >>
> >>     The LSM changes in particular may not be optimal, as I don't have a
> >>     lot of familiarity with this code, so I'd be especially appreciative
> >>     of review of these changes and suggestions on how to improve them.
> >
> > Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
> > that make a whole lot more sense than just turning off
> > the option of using labels on files. Gutting the ability
> > to use MAC in a namespace is a step down the road of
> > making MAC and namespaces incompatible.
> 
> This is not "turning off the option to use labels on files".
> 
> This is supporting mounting filesystems like ext4 by unprivileged users
> and not trusting the labels they set in the same way as we trust labels
> on filesystems mounted by privileged users.
> 
> The first step needs to be not trusting those labels and treating such
> filesystems as filesystems without label support.  I hope that is Seth
> has implemented.
> 
> In the long run we can do more interesting things with such filesystems
> once the appropriate LSM policy is in place.

Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.

Seth

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 20:36 ` Casey Schaufler
@ 2015-07-15 21:06   ` Eric W. Biederman
  2015-07-15 21:48     ` Seth Forshee
                       ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Eric W. Biederman @ 2015-07-15 21:06 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Seth Forshee, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux, Serge Hallyn, Andy Lutomirski,
	linux-kernel

Casey Schaufler <casey@schaufler-ca.com> writes:

> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> These are the first in a larger set of patches that I've been working on
>> (with help from Eric Biederman) to support mounting ext4 and fuse
>> filesystems from within user namespaces. I've pushed the full series to:
>>
>>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>
>> Taking the series as a whole, the strategy is to handle as much of the
>> heavy lifting as possible in the vfs so the filesystems don't have to
>> handle weird edge cases. If you look at the full series you'll find that
>> the changes in ext4 to support user namespace mounts turn out to be
>> fairly minimal (fuse is a bit more complicated though as it must deal
>> with translating ids for a userspace process which is running in pid and
>> user namespaces).
>>
>> The patches I'm sending today lay some of the groundwork in the vfs and
>> related code. They fall into two broad groups:
>>
>>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>     pretty straightforward, and Eric has expressed interest in merging
>>     these patches soon. Note that patch 2 won't apply cleanly without
>>     Eric's noexec patches for proc and sys [1].
>>
>>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>     &init_user_ns. This includes updates to how file caps and suid are
>>     handled and LSM updates to ignore security labels on superblocks
>>     from non-init namespaces.
>>
>>     The LSM changes in particular may not be optimal, as I don't have a
>>     lot of familiarity with this code, so I'd be especially appreciative
>>     of review of these changes and suggestions on how to improve them.
>
> Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
> LSM support in user namespaces ([RFC] lsm: namespace hooks)
> that make a whole lot more sense than just turning off
> the option of using labels on files. Gutting the ability
> to use MAC in a namespace is a step down the road of
> making MAC and namespaces incompatible.

This is not "turning off the option to use labels on files".

This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.

The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support.  I hope that is Seth
has implemented.

In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.

Getting s_user_ns present on struct super, properly set, and all of the
appropriate checks against it present in the vfs so that filesystems
don't need to duplicate logic is important if we are going do more
interesting things with user namespaces (as users have been asking for).

It is important for things as small as making it safe to allow
truly unprivileged users to mount fuse filesystems.

I am on the fence with Lukasz Pawelczyk's patches.  Some parts I liked
some parts I had issues with.  As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.

If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback.  Figuring out how to properly deal with
the LSMs has been one of his challenges.

I am hoping I can finishing working through the patches to fix the
semantics of rename and bind mounts before the next merge window opens,
so I can have enough cycles to lift the feature freeze on user
namespaces.  Except for maybe his first two patches (which fix a small
userspace API breakage) none of Seth's patches get to go in until I lift
the freeze.

Which is probably too much information but I hope this makes it clear
that the point of this work is as an enabler for future developments,
not as something to make user namespaces and LSMs incompatible.

Eric


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/7] Initial support for user namespace owned mounts
  2015-07-15 19:46 Seth Forshee
@ 2015-07-15 20:36 ` Casey Schaufler
  2015-07-15 21:06   ` Eric W. Biederman
  2015-07-16  3:15 ` Eric W. Biederman
  1 sibling, 1 reply; 69+ messages in thread
From: Casey Schaufler @ 2015-07-15 20:36 UTC (permalink / raw)
  To: Seth Forshee, Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux
  Cc: Serge Hallyn, Andy Lutomirski, linux-kernel

On 7/15/2015 12:46 PM, Seth Forshee wrote:
> These are the first in a larger set of patches that I've been working on
> (with help from Eric Biederman) to support mounting ext4 and fuse
> filesystems from within user namespaces. I've pushed the full series to:
>
>   git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>
> Taking the series as a whole, the strategy is to handle as much of the
> heavy lifting as possible in the vfs so the filesystems don't have to
> handle weird edge cases. If you look at the full series you'll find that
> the changes in ext4 to support user namespace mounts turn out to be
> fairly minimal (fuse is a bit more complicated though as it must deal
> with translating ids for a userspace process which is running in pid and
> user namespaces).
>
> The patches I'm sending today lay some of the groundwork in the vfs and
> related code. They fall into two broad groups:
>
>  1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>     pretty straightforward, and Eric has expressed interest in merging
>     these patches soon. Note that patch 2 won't apply cleanly without
>     Eric's noexec patches for proc and sys [1].
>
>  2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>     &init_user_ns. This includes updates to how file caps and suid are
>     handled and LSM updates to ignore security labels on superblocks
>     from non-init namespaces.
>
>     The LSM changes in particular may not be optimal, as I don't have a
>     lot of familiarity with this code, so I'd be especially appreciative
>     of review of these changes and suggestions on how to improve them.

Lukasz Pawelczyk <l.pawelczyk@samsung.com> proposed
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.



>
> Subsequent patches will update the vfs for id translation, handling
> various corner cases, giving privileges to the user namsepace which owns
> a superblock, and finally supporting user namespace mounts for ext4 and
> fuse.
>
> Thanks,
> Seth
>
> [1] http://lkml.kernel.org/r/87mvz4yomp.fsf_-_@x220.int.ebiederm.org
>
>
> Andy Lutomirski (1):
>   fs: Treat foreign mounts as nosuid
>
> Eric W. Biederman (1):
>   userns: Simpilify MNT_NODEV handling.
>
> Seth Forshee (5):
>   fs: Add user namesapace member to struct super_block
>   fs: Ignore file caps in mounts from other user namespaces
>   security: Restrict security attribute updates for userns mounts
>   selinux: Ignore security labels on user namespace mounts
>   smack: Don't use security labels for user namespace mounts
>
>  fs/block_dev.c                 |  2 +-
>  fs/exec.c                      |  2 +-
>  fs/namei.c                     |  9 ++++++++-
>  fs/namespace.c                 | 34 ++++++++++++++++++++--------------
>  fs/proc/root.c                 |  3 ++-
>  fs/super.c                     | 38 +++++++++++++++++++++++++++++++++-----
>  include/linux/fs.h             |  9 +++++++++
>  include/linux/mount.h          |  1 +
>  include/linux/user_namespace.h |  8 ++++++++
>  kernel/user_namespace.c        | 14 ++++++++++++++
>  security/commoncap.c           |  4 +++-
>  security/security.c            | 10 +++++++++-
>  security/selinux/hooks.c       | 16 +++++++++++++++-
>  security/smack/smack_lsm.c     | 12 ++++++++++--
>  14 files changed, 134 insertions(+), 28 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/7] Initial support for user namespace owned mounts
@ 2015-07-15 19:46 Seth Forshee
  2015-07-15 20:36 ` Casey Schaufler
  2015-07-16  3:15 ` Eric W. Biederman
  0 siblings, 2 replies; 69+ messages in thread
From: Seth Forshee @ 2015-07-15 19:46 UTC (permalink / raw)
  To: Eric W. Biederman, Alexander Viro, linux-fsdevel,
	linux-security-module, selinux
  Cc: Serge Hallyn, Andy Lutomirski, Seth Forshee, linux-kernel

These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
filesystems from within user namespaces. I've pushed the full series to:

  git://kernel.ubuntu.com/sforshee/linux.git userns-mounts

Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).

The patches I'm sending today lay some of the groundwork in the vfs and
related code. They fall into two broad groups:

 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
    pretty straightforward, and Eric has expressed interest in merging
    these patches soon. Note that patch 2 won't apply cleanly without
    Eric's noexec patches for proc and sys [1].

 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
    &init_user_ns. This includes updates to how file caps and suid are
    handled and LSM updates to ignore security labels on superblocks
    from non-init namespaces.

    The LSM changes in particular may not be optimal, as I don't have a
    lot of familiarity with this code, so I'd be especially appreciative
    of review of these changes and suggestions on how to improve them.

Subsequent patches will update the vfs for id translation, handling
various corner cases, giving privileges to the user namsepace which owns
a superblock, and finally supporting user namespace mounts for ext4 and
fuse.

Thanks,
Seth

[1] http://lkml.kernel.org/r/87mvz4yomp.fsf_-_@x220.int.ebiederm.org


Andy Lutomirski (1):
  fs: Treat foreign mounts as nosuid

Eric W. Biederman (1):
  userns: Simpilify MNT_NODEV handling.

Seth Forshee (5):
  fs: Add user namesapace member to struct super_block
  fs: Ignore file caps in mounts from other user namespaces
  security: Restrict security attribute updates for userns mounts
  selinux: Ignore security labels on user namespace mounts
  smack: Don't use security labels for user namespace mounts

 fs/block_dev.c                 |  2 +-
 fs/exec.c                      |  2 +-
 fs/namei.c                     |  9 ++++++++-
 fs/namespace.c                 | 34 ++++++++++++++++++++--------------
 fs/proc/root.c                 |  3 ++-
 fs/super.c                     | 38 +++++++++++++++++++++++++++++++++-----
 include/linux/fs.h             |  9 +++++++++
 include/linux/mount.h          |  1 +
 include/linux/user_namespace.h |  8 ++++++++
 kernel/user_namespace.c        | 14 ++++++++++++++
 security/commoncap.c           |  4 +++-
 security/security.c            | 10 +++++++++-
 security/selinux/hooks.c       | 16 +++++++++++++++-
 security/smack/smack_lsm.c     | 12 ++++++++++--
 14 files changed, 134 insertions(+), 28 deletions(-)


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2015-08-01 17:01 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-30  4:24 [PATCH 0/7] Initial support for user namespace owned mounts Amir Goldstein
2015-07-30 13:55 ` Seth Forshee
2015-07-30 14:47   ` Amir Goldstein
2015-07-30 15:33     ` Casey Schaufler
2015-07-30 15:52       ` Colin Walters
2015-07-30 16:15         ` Eric W. Biederman
2015-07-30 13:57 ` Serge Hallyn
2015-07-30 15:09   ` Amir Goldstein
  -- strict thread matches above, loose matches on Subject: below --
2015-07-31  8:11 Amir Goldstein
2015-07-31 19:56 ` Casey Schaufler
2015-08-01 17:01   ` Amir Goldstein
2015-07-15 19:46 Seth Forshee
2015-07-15 20:36 ` Casey Schaufler
2015-07-15 21:06   ` Eric W. Biederman
2015-07-15 21:48     ` Seth Forshee
2015-07-15 22:28       ` Eric W. Biederman
2015-07-16  1:05         ` Andy Lutomirski
2015-07-16  2:20           ` Eric W. Biederman
2015-07-16 13:12           ` Stephen Smalley
2015-07-15 23:04       ` Casey Schaufler
2015-07-15 22:39     ` Casey Schaufler
2015-07-16  1:08       ` Andy Lutomirski
2015-07-16  2:54         ` Casey Schaufler
2015-07-16  4:47           ` Eric W. Biederman
2015-07-17  0:09             ` Dave Chinner
2015-07-17  0:42               ` Eric W. Biederman
2015-07-17  2:47                 ` Dave Chinner
2015-07-21 17:37                   ` J. Bruce Fields
2015-07-22  7:56                     ` Dave Chinner
2015-07-22 14:09                       ` J. Bruce Fields
2015-07-22 16:52                         ` Austin S Hemmelgarn
2015-07-22 17:41                           ` J. Bruce Fields
2015-07-23  1:51                             ` Dave Chinner
2015-07-23 13:19                               ` J. Bruce Fields
2015-07-23 23:48                                 ` Dave Chinner
2015-07-18  0:07                 ` Serge E. Hallyn
2015-07-20 17:54             ` Colin Walters
2015-07-16 11:16     ` Lukasz Pawelczyk
2015-07-17  0:10       ` Eric W. Biederman
2015-07-17 10:13         ` Lukasz Pawelczyk
2015-07-16  3:15 ` Eric W. Biederman
2015-07-16 13:59   ` Seth Forshee
2015-07-16 15:09     ` Casey Schaufler
2015-07-16 18:57       ` Seth Forshee
2015-07-16 21:42         ` Casey Schaufler
2015-07-16 22:27           ` Andy Lutomirski
2015-07-16 23:08             ` Casey Schaufler
2015-07-16 23:29               ` Andy Lutomirski
2015-07-17  0:45                 ` Casey Schaufler
2015-07-17  0:59                   ` Andy Lutomirski
2015-07-17 14:28                     ` Serge E. Hallyn
2015-07-17 14:56                       ` Seth Forshee
2015-07-21 20:35                     ` Seth Forshee
2015-07-22  1:52                       ` Casey Schaufler
2015-07-22 15:56                         ` Seth Forshee
2015-07-22 18:10                           ` Casey Schaufler
2015-07-22 19:32                             ` Seth Forshee
2015-07-23  0:05                               ` Casey Schaufler
2015-07-23  0:15                                 ` Eric W. Biederman
2015-07-23  5:15                                   ` Seth Forshee
2015-07-23 21:48                                   ` Casey Schaufler
2015-07-28 20:40                                 ` Seth Forshee
2015-07-30 16:18                                   ` Casey Schaufler
2015-07-30 17:05                                     ` Eric W. Biederman
2015-07-30 17:25                                       ` Seth Forshee
2015-07-30 17:33                                         ` Eric W. Biederman
2015-07-17 13:21           ` Seth Forshee
2015-07-17 17:14             ` Casey Schaufler
2015-07-16 15:59     ` Seth Forshee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).