Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

From: "NeilBrown" <neilb@suse.de>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Al Viro" <viro@zeniv.linux.org.uk>,
	"Daire Byrne" <daire@dneg.com>,
	"Trond Myklebust" <trond.myklebust@hammerspace.com>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Linux NFS Mailing List" <linux-nfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	"LKML" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.
Date: Sat, 27 Aug 2022 09:06:51 +1000	[thread overview]
Message-ID: <166155521174.27490.456427475820966571@noble.neil.brown.name> (raw)
In-Reply-To: <CAHk-=wi_wwTxPTnFXsG8zdaem5YDnSd4OsCeP78yJgueQCb-1g@mail.gmail.com>

On Sat, 27 Aug 2022, Linus Torvalds wrote:
> On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote:
> >
> > If a filesystem supports parallel modification in directories, it sets
> > FS_PAR_DIR_UPDATE on the file_system_type.  lookup_open() and the new
> > lookup_hash_update() notice the flag and take a shared lock on the
> > directory, and rely on a lock-bit in d_flags, much like parallel lookup
> > relies on DCACHE_PAR_LOOKUP.
> 
> Ugh.

Thanks :-) - no, really - thanks for the high-level review!

> 
> I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
> updates" being important, but I *despise* locking code like this
> 
> +       if (wq && IS_PAR_UPDATE(dir))
> +               inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> +       else
> +               inode_lock_nested(dir, I_MUTEX_PARENT);
> 
> and I really really hope there's some better model for this.
> 
> That "wq" test in particular is just absolutely disgusting. So now it
> doesn't just depend on whether the directory supports parallel
> updates, now the caller can choose whether to do the parallel thing or
> not, and that's how "create" is different from "rename".

As you note, by the end of the series "create" is not more different
from "rename" than it already is.  I only broke up the patches to make
review more manageable.

The "wq" can be removed.  There are two options.
One is to change every kern_path_create() or user_path_create() caller
to passed in a wq.  Then we can assume that a wq is always available.
There are about a dozen of these calls, so not an enormous change, but
one that I didn't want to think about just yet.  I could add a patch at
the front of the series which did this.

Alternate option is to never pass in a wq for create operation, and use
var_waitqueue() (or something similar) to provide a global shared wait
queue (which is essentially what I am using to wait for
DCACHE_PAR_UPDATE to clear).
The more I think about it, the more I think this is the best way
forward.   Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how
to measure how much contention there is on these shared queues.

> 
> And that last difference is, I think, the one that makes me go "No. HELL NO".
> 
> Instead of it being up to the filesystem to say "I can do parallel
> creates, but I need to serialize renames", this whole thing has been
> set up to be about the caller making that decision.

I think that is a misunderstanding.  The caller isn't making a decision
- except the IS_PAR_UPDATE() test which is simply acting on the fs
request.  What you are seeing is a misguided attempt to leave in place
some existing interfaces which assumed exclusive locking and didn't
provide wqs.

> 
> That's just feels horribly horribly wrong.
> 
> Yes, I realize that to you that's just a "later patches will do
> renames", but what if it really is a filesystem issue where the
> filesystem can easily handle new names, but needs something else for
> renames because it has various cross-directory issues, perhaps?

Obviously a filesystem can add its own locking - and they will have to,
though at a finer grain that the VFS can do.

> 
> So I feel this is fundamentally wrong, and this ugliness is a small
> effect of that wrongness.
> 
> I think we should strive for
> 
>  (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I
realised that you wouldn't like that :-)

> 
>  (b) aim for the inode lock being taken *after* the _lookup_hash(),
> since the VFS layer side has to be able to handle the concurrency on
> the dcache side anyway

I think you are suggesting that we change ->lookup call to NOT
require i_rwsem be held.  That is not a small change.
I agree that it makes sense in the long term.  Getting there ....  won't
be a quick as I'd hoped.

> 
>  (c) at that point, the serialization really ends up being about the
> call into the filesystem, and aim to simply move the
> inode_lock{_shared]_nested() into the filesystem so that there's no
> need for a flag and related conditional locking at all.

It might be nice to take a shared lock in VFS, and let the FS upgrade it
to exclusive if needed, but we don't have upgrade_read() ...  maybe it
would be deadlock-prone.

> 
> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.
> 
> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.
> 
> Please? I think these patches are "mostly going in the right
> direction", but at the same time I feel like there's some serious
> mis-design going on.

Hmmm....  I'll dig more deeply into ->lookup and see if I can understand
the locking well enough to feel safe removing i_rwsem from it.

Thanks,
NeilBrown