Re: [PATCH v1] shmem: stable directory cookies

From: Jeff Layton <jlayton@kernel.org>
To: Chuck Lever III <chuck.lever@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Chuck Lever <cel@kernel.org>,
	"hughd@google.com" <hughd@google.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v1] shmem: stable directory cookies
Date: Thu, 04 May 2023 13:21:41 -0400	[thread overview]
Message-ID: <cbd955c08432a82014cc21f36e42afc67962a718.camel@kernel.org> (raw)
In-Reply-To: <30E5A657-4005-4126-A962-A8E6D90240AB@oracle.com>

On Wed, 2023-05-03 at 00:43 +0000, Chuck Lever III wrote:
> 
> > On May 2, 2023, at 8:12 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > On Mon, 17 Apr 2023 15:23:10 -0400 Chuck Lever <cel@kernel.org> wrote:
> > 
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > The current cursor-based directory cookie mechanism doesn't work
> > > when a tmpfs filesystem is exported via NFS. This is because NFS
> > > clients do not open directories: each READDIR operation has to open
> > > the directory on the server, read it, then close it. The cursor
> > > state for that directory, being associated strictly with the opened
> > > struct file, is then discarded.
> > > 
> > > Directory cookies are cached not only by NFS clients, but also by
> > > user space libraries on those clients. Essentially there is no way
> > > to invalidate those caches when directory offsets have changed on
> > > an NFS server after the offset-to-dentry mapping changes.
> > > 
> > > The solution we've come up with is to make the directory cookie for
> > > each file in a tmpfs filesystem stable for the life of the directory
> > > entry it represents.
> > > 
> > > Add a per-directory xarray. shmem_readdir() uses this to map each
> > > directory offset (an loff_t integer) to the memory address of a
> > > struct dentry.
> > > 
> > 
> > How have people survived for this long with this problem?
> 
> It's less of a problem without NFS in the picture; local
> applications can hold the directory open, and that preserves
> the seek cursor. But you can still trigger it.
> 
> Also, a plurality of applications are well-behaved in this
> regard. It's just the more complex and more useful ones
> (like git) that seem to trigger issues.
> 
> It became less bearable for NFS because of a recent change
> on the Linux NFS client to optimize directory read behavior:
> 
> 85aa8ddc3818 ("NFS: Trigger the "ls -l" readdir heuristic sooner")
> 
> Trond argued that tmpfs directory cookie behavior has always
> been problematic (eg broken) therefore this commit does not
> count as a regression. However, it does make tmpfs exports
> less usable, breaking some tests that have always worked.
> 
> 
> > It's a lot of new code -
> 
> I don't feel that this is a lot of new code:
> 
> include/linux/shmem_fs.h |    2 
> mm/shmem.c               |  213 +++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 201 insertions(+), 14 deletions(-)
> 
> But I agree it might look a little daunting on first review.
> I am happy to try to break this single patch up or consider
> other approaches.
> 

I wonder whether you really need an xarray here?

dcache_readdir walks the d_subdirs list. We add things to d_subdirs at
d_alloc time (and in d_move). If you were to assign its dirindex when
the dentry gets added to d_subdirs (maybe in ->d_init?) then you'd have
a list already ordered by index, and could deal with missing indexes
easily.

It's not as efficient as the xarray if you have to seek through a big
dir, but if keeping the changes tiny is a goal then that might be
another way to do this.

> We could, for instance, tuck a little more of this into
> lib/fs. Copying the readdir and directory seeking
> implementation from simplefs to tmpfs is one reason
> the insertion count is worrisome.
> 
> 
> > can we get away with simply disallowing
> > exports of tmpfs?
> 
> I think the bottom line is that you /can/ trigger this
> behavior without NFS, just not as quickly. The threshold
> is high enough that most use cases aren't bothered by
> this right now.
> 
> We'd rather not disallow exporting tmpfs. It's a very
> good testing platform for us, and disallowing it would
> be a noticeable regression for some folks.
> 
> 

Yeah, I'd not be in favor of that either. We've had an exportable tmpfs
for a long time. It's a good way to do testing of the entire NFS server
stack, without having to deal with underlying storage.

> > How can we maintain this?  Is it possible to come up with a test
> > harness for inclusion in kernel selftests?
> 
> There is very little directory cookie testing that I know of
> in the obvious place: fstests. That would be where this stuff
> should be unit tested, IMO.
> 

I'd like to see this too. It's easy for programs to get this wrong. In
this case, could we emulate the NFS behavior by doing this in a loop
over a large directory?

opendir
seekdir (to result of last telldir)
readdir
unlink
telldir
closedir

At the end of it, check whether there are any entries left over.
-- 
Jeff Layton <jlayton@kernel.org>