Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

From: Linus Torvalds <torvalds@linux-foundation.org>
To: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>, Daire Byrne <daire@dneg.com>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.
Date: Fri, 26 Aug 2022 17:13:38 -0700	[thread overview]
Message-ID: <CAHk-=whz69y=98udgGB5ujH6bapYuapwfHS2esWaFrKEoi9-Ow@mail.gmail.com> (raw)
In-Reply-To: <166155521174.27490.456427475820966571@noble.neil.brown.name>

On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@suse.de> wrote:
>
> As you note, by the end of the series "create" is not more different
> from "rename" than it already is.  I only broke up the patches to make
> review more manageable.

Yes, I understand. But I'm saying that maybe a filesystem actually
might want to treat them differently.

That said, the really nasty part was that 'wq' thing that meant that
different paths had different directory locking not because of
low-level filesystem issues, but because of caller issues.

So that's the one I _really_ disliked, and that I don't think should
exist even as a partial first step.

The "tie every operation together with one flag" I can live with, in
case it turns out that yes, that one flag is all anybody ever really
wants.

> Alternate option is to never pass in a wq for create operation, and use
> var_waitqueue() (or something similar) to provide a global shared wait
> queue (which is essentially what I am using to wait for
> DCACHE_PAR_UPDATE to clear).

I _think_ this is what I would prefer.

I say that I _think_ I prefer that, because maybe there are issues
with it. But since you basically do that DCACHE_PAR_UPDATE thing
anyway, and it's one of the main users of this var_waitqueue, it feels
right to me.

But then if it just end sup not working well for some practical
reason, at that point maybe I'd just say "I was wrong, I thought it
would work, but it's better to spread it out to be a per-thread
wait-queue on the stack".

IOW, my preference would be to simply just try it, knowing that you
*can* do the "pass explicit wait-queue down" thing if we need to.

Hmm?

> > Instead of it being up to the filesystem to say "I can do parallel
> > creates, but I need to serialize renames", this whole thing has been
> > set up to be about the caller making that decision.
>
> I think that is a misunderstanding.  The caller isn't making a decision
> - except the IS_PAR_UPDATE() test which is simply acting on the fs
> request.  What you are seeing is a misguided attempt to leave in place
> some existing interfaces which assumed exclusive locking and didn't
> provide wqs.

Ok. I still would prefer to have unified locking, not that "do this
for one filesystem, do that for another" conditional one.

> >  (b) aim for the inode lock being taken *after* the _lookup_hash(),
> > since the VFS layer side has to be able to handle the concurrency on
> > the dcache side anyway
>
> I think you are suggesting that we change ->lookup call to NOT
> require i_rwsem be held.

Yes and no.

One issue for me is that with your change as-is, then 99% of all
people who don't happen to use NFS, the inode lock gives all that VFS
code mutual exclusion.

Take that lookup_hash_update() function as a practical case: all the
*common* filesystems will be running with that function basically 100%
serialized per directory, because they'll be doing that

        inode_lock_nested(dir);
        ...
        inode_unlock(dir);

around it all.

At the same time, all that code is supposed to work even *without* the
lock, because once it's a IS_PAR_UPDATE() filesystem, there's
effectively no locking at all. What exclusive directory locks even
remain at that point?

IOW, to me it feels like you are trying to make the code go towards a
future with basically no locking at all as far as the VFS layer is
concerned (because once all the directory modifications take the inode
lock as shared, *all* the inode locking is shared, and is basically a
no-op).

BUT you are doing so while not having most people even *test* that situation.

See what I'm trying to say (but possibly expressing very badly)?

So I feel like if the VFS code cannot rely on locking *anyway* in the
general case, and should work without it, then we really shouldn't
have any locking around any of the VFS operations.

The logical conclusion of that would be to push it all down into the
filesystem (possibly with the help of a coccinelle script).

Now it doesn't have to go that far - at least not initially - but I do
think we should at least make sure that as much as possible of the
actual VFS code sees that "new world order" of no directory locking,
so that that situation gets *tested* as widely as possible.

> That is not a small change.

Now, that I agree with. I guss we won't get there soon (or ever). But
see above what I dislike about the directory locking model change.

> It might be nice to take a shared lock in VFS, and let the FS upgrade it
> to exclusive if needed, but we don't have upgrade_read() ...  maybe it
> would be deadlock-prone.

Yes, upgrading a read lock is fundamentally impossible and will
deadlock trivially (think just two readers that both want to do the
upgrade - they'll block each other from doing so).

So it's not actually a possible operation.

                    Linus