Re: git regression failures with v6.2-rc NFS client

From: Benjamin Coddington <bcodding@redhat.com>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna@kernel.org>
Subject: Re: git regression failures with v6.2-rc NFS client
Date: Fri, 03 Feb 2023 15:01:04 -0500	[thread overview]
Message-ID: <104B6879-5223-485F-B099-767F741EB15B@redhat.com> (raw)
In-Reply-To: <79261B77-35D0-4E36-AA29-C7BF9FB734CC@oracle.com>

On 3 Feb 2023, at 13:03, Chuck Lever III wrote:
> Naive suggestion: Would another option be to add server
> support for the directory verifier?

I don't think it would resolve this bug, because what can the client do to
find its place in the directory stream after getting NFS4ERR_NOT_SAME?

Directory verifiers might help another class of bugs, where a linear walk
through the directory produces duplicate entries.. but I think it only helps
some of those cases.

Knfsd /could/ use the directory verifier to keep a reference to an opened
directory.  Perhaps knfsd's open file cache can be used.  Then, as long as
we're doing a linear walk through the directory, the local directory's
file->private cursor would continue to be valid to produce consistent linear
results even if the dentries are removed in between calls to READDIR.

.. but that's not how the verifier is supposed to be used - rather it's
supposed to signal when the client's cookies are invalid, which, for tmpfs,
would be anytime any changes are made to the directory.

> Well, let's not argue semantics. The optimization exposes
> this (previously known) bug to a wider set of workloads.

Ok, yes.

> Please, open some bugs that document it. Otherwise this stuff
> will never get addressed. Can't fix what we don't know about.
>
> I'm not claiming tmpfs is perfect. But the optimization seems
> to make things worse for more workloads. That's always been a
> textbook definition of regression, and a non-starter for
> upstream.

I guess I can - my impression has been there's no interest in fixing tmpfs
for use over NFS..  after my last round of work on it I resolved to tell any
customers that asked for it to do:

[root@fedora ~]# modprobe brd rd_size=1048576 rd_nr=1
[root@fedora ~]# mkfs.xfs /dev/ram0

.. instead.  Though, I think that tmpfs is going to have better performance.

>> I spent a great deal of time making points about it and arguing that the
>> client should try to work around them, and ultimately was told exactly this:
>> https://lore.kernel.org/linux-nfs/a9411640329ed06dd7306bbdbdf251097c5e3411.camel@hammerspace.com/
>
> Ah, well "client should work around them" is going to be a
> losing argument every time.

Yeah, I agree with that, though at the time I naively thought the issues
could be solved by better validation of changing directories.

So.. I guess I lost?  I wasn't aware of the stable cookie issues fully
until Trond pointed them out and I compared tmpfs and xfs.  At that point, I
saw there's not really much of a path forward for tmpfs, unless we want to
work pretty hard at it.  But why?  Any production server wanting performance
and stability on an NFS export isn't going to use tmpfs on knfsd.  There are
good memory-speed alternatives.

>> The optimization you'd like to revert fixes a performance regression on
>> workloads across exports of all filesystems.  This is a fix we've had many
>> folks asking for repeatedly.
>
> Does the community agree that tmpfs has never been a first-tier
> filesystem for exporting? That's news to me. I don't recall us
> deprecating support for tmpfs.

I'm definitely not telling folks to use it as exported from knfsd.  I'm
instead telling them, "Directory listings are going to be unstable, you'll
see problems." That includes any filesystems with file_operations of
simple_dir_operations.

That said, the whole opendir, getdents, unlink, getdents pattern is maybe
fine for a test that can assume it has exclusive rights (writes?) to a
directory, but also probably insane for anything else that wants to reliably
remove the thing, and we'll find that's why `rm -rf` still works.  It does
opendir, getdents, getdents, getdents, unlink, unlink, unlink, .. rmdir.
This more closely corresponds to POSIX stating:

    If a file is removed from or added to the directory after the most recent
    call to opendir() or rewinddir(), whether a subsequent call to readdir()
    returns an entry for that file is unspecified.

> If an optimization broke ext4, xfs, or btrfs, it would be
> reverted until the situation was properly addressed. I don't
> see why the situation is different for tmpfs just because it
> has more bugs.

Maybe it isn't?  We've yet to hear from any authoritative sources on this.

>> I hope the maintainers decide not to revert
>> it, and that we can also find a way to make readdir work reliably on tmpfs.
>
> IMO the guidelines go the other way: readdir on tmpfs needs to
> be addressed first. Reverting is a last resort, but I don't see
> a fix coming before v6.2. Am I wrong?

Is your opinion wrong?  :)  IMO, yes, because this is just another round of
"we don't fix the client for broken server behaviors".  If we're going to
disagree, know that its entirely amicable from my side.  :)

> What I fear will happen is that the optimization will go in, and
> then nobody will address (or even document) the tmpfs problem,
> making it unusable. That would be unfortunate and wrong.
>
> Please look at tmpfs and see how difficult it will be to address
> the cookie problem properly.

I think tmpfs doesn't have any requirement to report consistent offsets in
between calls to close(dir) and open(dir) - so if you re-open the directory
a second time and seek to some position, you're not going to be able to
connect some earlier part of the stream to what you see now.

So, I don't think its a tmpfs problem - its a bit of a combination of issues
that contrive to make READDIR results inconsistent on NFS.  The best place
to improve things might be to have knfsd try to keep the directory open in
between calls to READDIR for filesystems of simple_dir_operations.

Really, the root of the problem is that there's no stateful way to walk a
directory that's changing, especially if you're expecting things to be the
same in between close() and open().  XFS and ext4 do a pretty good job of
making that problem disappear by hashing out their dentries across a very
large range, but without some previous state or versioning, it seems a hard
problem to solve.

Ben