Re: [Lsf-pc] [LSF/MM/BPF TOPIC] How to make disconnected operation work?

From: David Howells <dhowells@redhat.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: dhowells@redhat.com, lsf-pc@lists.linux-foundation.org,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Steve French <sfrench@samba.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jeff Layton <jlayton@redhat.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] How to make disconnected operation work?
Date: Mon, 27 Jan 2020 16:32:41 +0000	[thread overview]
Message-ID: <1477632.1580142761@warthog.procyon.org.uk> (raw)
In-Reply-To: <CAOQ4uxicFmiFKz7ZkHYuzduuTDaCTDqo26fo02-VjTMmQaaf+A@mail.gmail.com>

Amir Goldstein <amir73il@gmail.com> wrote:

> My thinking is: Can't we implement a stackable cachefs which interfaces
> with fscache and whose API to the netfs is pure vfs APIs, just like
> overlayfs interfaces with lower fs?

In short, no - doing it with pure the VFS APIs that we have is not that simple
(yes, Solaris does it with a stacking filesystem, and I don't know anything
about the API details, but there must be an auxiliary API).  You need to
handle:

 (1) Remote invalidation.  The netfs needs to tell the cache layer
     asynchronously about remote modifications - where the modification can
     modify not just file content but also directory structure, and even file
     data invalidation may be partial.

 (2) Unique file group matching.  The info required to match a group of files
     (e.g. an NFS server, an AFS volume, a CIFS share) is not necessarily
     available through the VFS API - I'm not sure even the export API makes
     this available since it's built on the assumption that it's exporting
     local files.

 (3) File matching.  The info required to match a file to the cache is not
     necessarily available through the VFS API.  NFS has file handles, for
     example; the YFS variant of AFS has 96-bit 'inode numbers'.  (This might
     be done with the export API - it that's counted so).  Further, the file
     identifier may not be unique outside the file group.

 (4) Coherency management.  The netfs must tell the cache whether or not the
     data contained in the cache is valid.  This information is not
     necessarily available through the VFS APIs (NFS change IDs, AFS data
     version, AFS volume sync info).  It's also highly filesystem specific.

It might also have security implications for netfs's that handle their own
security (such as AFS does), but that might fall out naturally.

> As long as netfs supports direct_IO() (all except afs do) then the active page
> cache could be that of the stackable cachefs and network IO is always
> direct from/to cachefs pages.

What about objects that don't support DIO?  Directories, symbolic links and
automount points?  All of these things are cacheable objects with AFS.

And speaking of automount points - how would you deal with those beyond simply
caching the contents?  Create a new stacked instance over it?  How do you see
the automount point itself?

I see that the NFS FH encoder doesn't handle automount points.

> If netfs supports export_operations (all except afs do), then indexing
> the cache objects could be done in a generic manner using fsid and
> file handle, just like overlayfs index feature works today.

FSID isn't unique and doesn't exist for all filesystems.  Two NFS servers, for
example, can give you the same FSID, but referring to different things.  AFS
has a textual cell name and a volume ID that you need to combine; it doesn't
have an FSID.

This may work for overlayfs as the FSID can be confined to a particular
overlay.  However, that's not what we're dealing with.  We would be talking
about an index that potentially covers *all* the mounted netfs.

Also, from your description that sounds like a bug in overlayfs.  If the
overlain NFS tree does a referral to a different server, you no longer have a
unique FSID or a unique FH within that FSID so your index is broken.

> Would it not be a maintenance win if all (or most of) the fscache logic
> was yanked out of all the specific netfs's?

Actually, it may not help enormously with disconnected operation.  A certain
amount of the logic probably has to be implemented in the netfs as each netfs
provides different facilities for managing this.

Yes, it gets some of the I/O stuff out - but I want to move some of that down
into the VM if I can and librarifying the rest should take care of that.

> Can you think of reasons why the stackable cachefs model cannot work
> or why it is inferior to the current fscache integration model with netfs's?

Yes.  It's a lot more operationally expensive and it's harder to use.  The
cache driver would also have to get a lot bigger, but that would be
reasonable.

Firstly, the expense: you have to double up all the inodes and dentries that
are in use - and that's not counting the resources used inside the cache
itself.

Secondly, the administration: I'm assuming you're suggesting the way I think
Solaris does it and that you have to make two mounts: firstly you mount the
netfs and then you mount the cache over it.  It's much simpler if you just
need make the netfs mount only and then that goes and uses the cache if it's
available - it's also simple to bring the cache online after the fact meaning
you can even cache applied retroactively to a root filesystem.

You also have the issue of what happens if someone bind-mounts the netfs mount
and mounts the cache over only one of the views.  Now you have a coherency
management problem that the cache cannot see.  It's only visible to the netfs,
but the netfs doesn't know about the cache.

There's also file locking.  Overlayfs doesn't support file locking that I can
see, but NFS, AFS and CIFS all do.

Anyway, you might be able to guess that I'm really against using stackable
filesystems for things like this and like UID shifting.  I think it adds more
expense and complexity than it's necessarily worth.

I was more inclined to go with unionfs than overlayfs and do the filesystem
union in the VFS as it ought to be cheaper if you're using it (whereas
overlayfs is cheaper if you're not).

One final thing - even if we did want to switch to an stacked approach, we
might still have to maintain the current way as people use it.

David