Re: [LSF/MM TOPIC] Containers and distributed filesystems

From: Trond Myklebust <trondmy@hammerspace.com>
To: "James.Bottomley@HansenPartnership.com" 
	<James.Bottomley@HansenPartnership.com>,
	"lsf-pc@lists.linux-foundation.org" 
	<lsf-pc@lists.linux-foundation.org>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [LSF/MM TOPIC] Containers and distributed filesystems
Date: Wed, 23 Jan 2019 20:50:41 +0000	[thread overview]
Message-ID: <28ba1a0012e84a789d2f402d292935e98266212b.camel@hammerspace.com> (raw)
In-Reply-To: <1548271299.2949.41.camel@HansenPartnership.com>

On Wed, 2019-01-23 at 11:21 -0800, James Bottomley wrote:
> On Wed, 2019-01-23 at 18:10 +0000, Trond Myklebust wrote:
> > Hi,
> > 
> > I'd like to propose an LSF/MM discussion around the topic of
> > containers and distributed filesystems.
> > 
> > The background is that we have a number of decisions to make around
> > dealing with namespaces when the filesystem is distributed.
> > 
> > On the one hand, there is the issue of which user namespace we
> > should
> > be using when putting uids/gids on the wire, or when translating
> > into
> > alternative identities (user/group name, cifs SIDs,...). There are
> > two main competing proposals: the first proposal is to select the
> > user namespace of the process that mounted the distributed
> > filesystem. The second proposal is to (continue to) use the user
> > namespace pointed to by init_nsproxy. It seems that whichever
> > choice
> > we make, we probably want to ensure that all the major distributed
> > filesystems (AFS, CIFS, NFS) have consistent handling of these
> > situations.
> 
> I don't think there's much disagreement among container people: most
> would agree the uids on the wire should match the uids in the
> container.  If you're running your remote fs via fuse in an
> unprivileged container, you have no access to the kuid/kgid anyway,
> so
> it's the way you have to run.
> 
> I think the latter comes about because most of the container
> implementations still have difficulty consuming the user namespace,
> so
> most run without it (where kuid = uid) or mis-implement it, which is
> where you might get the mismatch.  Is there an actual use case where
> you'd want to see the kuid at the remote end, bearing in mind that
> when
> user namespaces are properly set up kuid is often the product of
> internal subuid mapping.

Wouldn't the above basically allow you to spoof root on any existing
mounted NFS client using the unprivileged command 'unshare -U -r'?

Eric Biederman was the one proposing the 'match the namespace of the
process that mounted the filesystem' approach. My main questions about
that approach would be:
1) Are we guaranteed to always have a mapping between an arbitrary
uid/gid from the user namespace in the container, to the user namespace
of the parent orchestrator process that set up the mount?
2) How do we reconcile that approach with the requirement that NFSv4 be
able to convert uids/gids into stringified user/group names (which is
usually solved using an upcall mechanism)?

> > Another issue arises around the question of identifying containers
> > when they are migrated. At least the NFSv4 client needs to be able
> > to
> > send a unique identifier that is preserved across container
> > migration. The uts_namespace is typically insufficient for this
> > purpose, since most containers don't bother to set a unique
> > hostname.
> 
> We did have a discussion in plumbers about the container ID, but I'm
> not sure it reached a useful conclusion for you (video, I'm afraid):
> 
> https://linuxplumbersconf.org/event/2/contributions/215/

I have a concrete proposal for how we can do this using 'udev', and I'm
looking for a forum in which to discuss it.

> > Finally, there is an issue that may be unique to NFS (in which case
> > I'd be happy to see it as a hallway discussion or a BoF session)
> > around preserving file state across container migrations.
> 
> If by file state, you mean the internal kernel struct file state,
> doesn't CRIU already do that? or do you mean some other state?

I thought CRIU was unable to deal with file locking state?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com