Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces

From: Sargun Dhillon <sargun@sargun.me>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "anna.schumaker@netapp.com" <anna.schumaker@netapp.com>,
	"smayhew@redhat.com" <smayhew@redhat.com>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"dhowells@redhat.com" <dhowells@redhat.com>,
	"schumaker.anna@gmail.com" <schumaker.anna@gmail.com>,
	"alban.crequy@gmail.com" <alban.crequy@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mauricio@kinvolk.io" <mauricio@kinvolk.io>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces
Date: Wed, 11 Nov 2020 18:57:28 +0000	[thread overview]
Message-ID: <20201111185727.GA27945@ircssh-2.c.rugged-nimbus-611.internal> (raw)
In-Reply-To: <8feccf45f6575a204da03e796391cc135283eb88.camel@hammerspace.com>

On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > Hi,
> > > > 
> > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > > NFS
> > > > share with a different user namespace. fsopen() is done in the
> > > > container namespaces (user, mnt and net namespaces) while
> > > > fsconfig(),
> > > > fsmount() and move_mount() are done on the host namespaces. The
> > > > mount
> > > > on the host is available in the container via mount propagation
> > > > from
> > > > the host mount.
> > > > 
> > > > With this, the files on the NFS server with uid 0 are available
> > > > in
> > > > the
> > > > container with uid 0. On the host, they are available with uid
> > > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > > > 
> > > 
> > > Can someone please tell me what is broken with the _current_ design
> > > before we start trying to push "fixes" that clearly break it?
> > Currently the mechanism of mounting nfs4 in a user namespace is as
> > follows:
> > 
> > Parent: fork()
> > Child: setns(userns)
> > C: fsopen("nfs4") = 3
> > C->P: Send FD 3
> > P: FSConfig...
> > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > 
> > 
> > Right now, when you mount an NFS filesystem in a non-init user
> > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > are sent to the server are not the UIDs from the mounting namespace,
> > instead they are the UIDs from the init user ns.
> > 
> > The reason for this is that you can call fsopen("nfs4") in the
> > unprivileged 
> > namespace, and that configures fs_context with all the right
> > information for 
> > that user namespace, but we currently require CAP_SYS_ADMIN in the
> > init user 
> > namespace to call fsmount. This means that the superblock's user
> > namespace is 
> > set "correctly" to the container, but there's absolutely no way
> > nfs4uidmap
> > to consume an unprivileged user namespace.
> > 
> > This behaviour happens "the other way" as well, where the UID in the
> > container
> > may be 0, but the corresponding kuid is 1000. When a response from an
> > NFS
> > server comes in we decode it according to the idmap userns[1]. The
> > userns
> > used to get create idmap is generated at fsmount time, and not as
> > fsopen
> > time. So, even if the filesystem is in the user namespace, and the
> > server
> > responds with UID 0, it'll come up with an unmapped UID.
> > 
> > This is because we do
> > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > from_kuid(container_ns, 0) -> invalid uid
> > 
> > This is broken behaviour, in my humble opinion as is it makes it
> > impossible to 
> > use NFSv4 (and v3 for that matter) out of the box with unprivileged
> > user 
> > namespaces. At least in our environment, using usernames / GSS isn't
> > an option,
> > so we have to rely on UIDs being set correctly [at least from the
> > container's
> > perspective].
> > 
> 
> The current code for setting server->cred was developed independently
> of fsopen() (and predates it actually). I'm fine with the change to
> have server->cred be the cred of the user that called fsopen(). That's
> in line with what we used to do for sys_mount().
> 
Just curious, without FS_USERNS, how were you mounting NFSv4 in an
unprivileged user ns?

> However all the other stuff to throw errors when the user namespace is
> not init_user_ns introduces massive regressions.
> 

I can remove that and respin the patch. How do you feel about that?  I would 
still like to keep the log lines though because it is a uapi change. I am 
worried that someone might exercise this path with GSS and allow for upcalls 
into the main namespaces by accident -- or be confused of why they're seeing 
upcalls "in a different namespace".

Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from fs_context during 
mount") without any changes?

I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
/*
 * nfs4idmap is not fully isolated by user namespaces. It is currently
 * only network namespace aware. If upcalls never happen, we do not
 * need to worry as nfs_client instances aren't shared between
 * user namespaces.
 */
if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && 
	!(server->caps & NFS_CAP_UIDGID_NOMAP)) {
	error = -EINVAL;
	errorf(fc, "Mount credentials are from non init user namespace and ID mapping is enabled. This is not allowed.");
	goto error;
}

(and making it so we can call idmap_userns)

> > > 
> > > The current design assumes that the user namespace being used is
> > > the one where 
> > > the mount itself is performed. That means that the uids and gids or
> > > usernames 
> > > and groupnames that go on the wire match the uids and gids of the
> > > container in 
> > > which the mount occurred.
> > > 
> > 
> > Right now, NFS does not have the ability for the fsmount() call to be
> > called in an unprivileged user namespace. We can change that
> > behaviour
> > elsewhere if we want, but it's orthogonal to this.
> > 
> > > The assumption is that the server has authenticated that client as
> > > belonging to a domain that it recognises (either through strong
> > > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> > > addresses to a list of acceptable clients).
> > > 
> > I added a rejection for upcalls because upcalls can happen in the
> > init 
> > namespaces. We can drop that restriction from the nfs4 patch if you'd
> > like. I
> > *believe* (and I'm not a little out of my depth) that the request-key
> > handler gets called with the *network namespace* of the NFS mount,
> > but the userns is a privileged one, allowing for potential hazards.
> > 
> 
> The idmapper already rejects upcalls to the keyring '/sbin/request-key'
> utility if you're running with your own user namespace.
> 
> Quite frankly, switching to using the keyring was a mistake which I'd
> undo if I could. Aside from not supporting containers, it is horribly
> slow due to requiring a full process startup/teardown for every upcall,
> so it scales poorly to large numbers of identities (particularly with
> an operation like readdir() in which you're doing serial upcalls).
> 
> However nothing stops you from using the old NFSv4 idmapper daemon
> (a.k.a. rpc.idmapd) in the context of the container that called
> fsopen() so that it can translate identities correctly using whatever
> userspace tools (ldap, sssd, winbind...) that the container has
> configured.
> 

1. We see this as a potential security risk [this being upcalls] into the 
unconfined portion of the system. Although, I'm sure that the userspace handlers 
are written perfectly well, it allows for information leakage to occur.

2. Is there a way to do this for NFSv3? 

3. Can rpc.idmapd get the user namespace that the call is from (and is the 
keyring per-userns?). In general, I think that this change follows the principal 
of least surprise.

> > The reason I added that block there is that I didn't imagine anyone
> > was running 
> > NFS in an unprivileged user namespace, and relying on upcalls
> > (potentially into 
> > privileged namespaces) in order to do authz.
> > 
> > 
> > > If you go ahead and change the user namespace on the client without
> > > going through the mount process again to mount a different super
> > > block
> > > with a different user namespace, then you will now get the exact
> > > same
> > > behaviour as if you do that with any other filesystem.
> > 
> > Not exactly, because other filesystems *only* use the s_user_ns for
> > conversion 
> > of UIDs, whereas NFS uses the currend_cred() acquired at mount time,
> > which 
> > doesn't match s_user_ns, leading to this behaviour.
> > 
> > 1. Mistranslated UIDs in encoding RPCs
> > 2. The UID / GID exposed to VFS do not match the user ns.
> > 
> > > 
> > > -- 
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@hammerspace.com
> > > 
> > > 
> > -Thanks,
> > Sargun
> > 
> > [1]:  
> > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
> > [2]:  
> > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
>