Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()

From: Miklos Szeredi <miklos@szeredi.hu>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	overlayfs <linux-unionfs@vger.kernel.org>,
	LSM <linux-security-module@vger.kernel.org>,
	linux-kernel@vger.kernel.org,
	"Serge E. Hallyn" <serge@hallyn.com>
Subject: Re: [PATCH v2 01/10] vfs: move cap_convert_nscap() call into vfs_setxattr()
Date: Tue, 12 Jan 2021 10:43:05 +0100	[thread overview]
Message-ID: <CAJfpegtKMwTZwENX7hrVGUVRWgNTf4Tr_bRxYrPpPAH_D2fH-Q@mail.gmail.com> (raw)
In-Reply-To: <874kjnm2p2.fsf@x220.int.ebiederm.org>

On Tue, Jan 12, 2021 at 1:15 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> writes:
>
> > On Fri, Jan 01, 2021 at 11:35:16AM -0600, Eric W. Biederman wrote:

> > For one: a v2 fscap is supposed to be equivalent to a v3 fscap with a rootid of
> > zero, right?
>
> Yes.  This assumes that everything is translated into the uids of the
> target filesystem.
>
> > If so, why does cap_inode_getsecurity() treat them differently (v2 fscap
> > succeeding unconditionally while v3 one being either converted to v2, rejected
> > or left as v3 depending on current_user_ns())?
>
> As I understand it v2 fscaps have always succeeded unconditionally.  The
> only case I can see for a v2 fscap might not succeed when read is if the
> filesystem is outside of the initial user namespace.

Looking again, it's rather confusing.  cap_inode_getsecurity()
currently handles the following cases:

v1: -> fails with -EINVAL

v2: -> returns unconverted xattr

v3:
 a) rootid is mapped in the current namespace to non-zero:
     -> convert rootid

 b) rootid owns the current or ancerstor namespace:
     -> convert to v2

 c) rootid is not mapped and is not owner:
     -> return -EOPNOTSUPP -> falls back to unconverted v3

So lets take the example, where a tmpfs is created in a private user
namespace and one file has a v2 cap and the other an equivalent v3 cap
with a zero rootid.  This is the result when looking at it from

1) the namespace of the fs:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip

2) the initial namespace:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip [rootid=1000]

3) an unrelated namespace:
---------------------------------------
t = cap_dac_override+eip
tt = cap_dac_override+eip

Note: in this last case getxattr will actually return a v3 cap with
zero rootid for "tt" which getcap does not display due to being zero.
I could do a setup with a nested namespaces that better demonstrate
the confusing nature of this, but I think this also proves the point.

At this point userspace simply cannot determine whether the returned
cap is in any way valid or not.

The following semantics would make a ton more sense, since getting a
v2 would indicate that rootid is unknown:

- if cap is v2 convert to v3 with zero rootid
- after this, check if rootid needs to be translated, if not return v3
- if yes, try to translate to current ns, if succeeds return translated v3
- if not mappable, return v2

Hmm?

> > Anyway, here's a patch that I think fixes getxattr() layering for
> > security.capability.  Does basically what you suggested.  Slight change of
> > semantics vs. v1 caps, not sure if that is still needed, getxattr()/setxattr()
> > hasn't worked for these since the introduction of v3 in 4.14.
> > Untested.
>
> Taking a look.  The goal of change how these operate is to make it so
> that layered filesystems can just pass through the data if they don't
> want to change anything (even with the user namespaces of the
> filesystems in question are different).
>
> Feedback on the code below:
> - cap_get should be in inode_operations like get_acl and set_acl.

So it's not clear to me why xattr ops are per-sb and acl ops are per-inode.

> - cap_get should return a cpu_vfs_cap_data.
>
>   Which means that only make_kuid is needed when reading the cap from
>   disk.

It also means translating the cap bits back and forth between disk and
cpu endian.  Not a big deal, but...

>   Which means that except for the rootid_owns_currentns check (which
>   needs to happen elsewhere) default_cap_get should be today's
>   get_vfs_cap_from_disk.

That's true.   So what's the deal with v1 caps?  Support was silently
dropped for getxattr/setxattr but remained in get_vfs_caps_from_disk()
(I guess to not break legacy disk images), but maybe it's time to
deprecate v1 caps completely?

> - With the introduction of cap_get I believe commoncap should stop
>   implementing the security_inode_getsecurity hook, and rather have
>   getxattr observe is the file capability xatter and call the new
>   vfs_cap_get then translate to a v2 or v3 cap as appropriate when
>   returning the cap to userspace.

Confused.  vfs_cap_get() is the one the layered filesystem will
recurse with, so it must not translate the cap.   The one to do that
would be __vfs_getxattr(), right?

Thanks,
Miklos