linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: "Al Viro" <viro@zeniv.linux.org.uk>,
	"Daire Byrne" <daire@dneg.com>,
	"Trond Myklebust" <trond.myklebust@hammerspace.com>,
	"Chuck Lever" <chuck.lever@oracle.com>,
	"Linux NFS Mailing List" <linux-nfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	"LKML" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.
Date: Sat, 27 Aug 2022 09:06:51 +1000	[thread overview]
Message-ID: <166155521174.27490.456427475820966571@noble.neil.brown.name> (raw)
In-Reply-To: <CAHk-=wi_wwTxPTnFXsG8zdaem5YDnSd4OsCeP78yJgueQCb-1g@mail.gmail.com>

On Sat, 27 Aug 2022, Linus Torvalds wrote:
> On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@suse.de> wrote:
> >
> > If a filesystem supports parallel modification in directories, it sets
> > FS_PAR_DIR_UPDATE on the file_system_type.  lookup_open() and the new
> > lookup_hash_update() notice the flag and take a shared lock on the
> > directory, and rely on a lock-bit in d_flags, much like parallel lookup
> > relies on DCACHE_PAR_LOOKUP.
> 
> Ugh.

Thanks :-) - no, really - thanks for the high-level review!

> 
> I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
> updates" being important, but I *despise* locking code like this
> 
> +       if (wq && IS_PAR_UPDATE(dir))
> +               inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> +       else
> +               inode_lock_nested(dir, I_MUTEX_PARENT);
> 
> and I really really hope there's some better model for this.
> 
> That "wq" test in particular is just absolutely disgusting. So now it
> doesn't just depend on whether the directory supports parallel
> updates, now the caller can choose whether to do the parallel thing or
> not, and that's how "create" is different from "rename".

As you note, by the end of the series "create" is not more different
from "rename" than it already is.  I only broke up the patches to make
review more manageable.

The "wq" can be removed.  There are two options.
One is to change every kern_path_create() or user_path_create() caller
to passed in a wq.  Then we can assume that a wq is always available.
There are about a dozen of these calls, so not an enormous change, but
one that I didn't want to think about just yet.  I could add a patch at
the front of the series which did this.

Alternate option is to never pass in a wq for create operation, and use
var_waitqueue() (or something similar) to provide a global shared wait
queue (which is essentially what I am using to wait for
DCACHE_PAR_UPDATE to clear).
The more I think about it, the more I think this is the best way
forward.   Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how
to measure how much contention there is on these shared queues.

> 
> And that last difference is, I think, the one that makes me go "No. HELL NO".
> 
> Instead of it being up to the filesystem to say "I can do parallel
> creates, but I need to serialize renames", this whole thing has been
> set up to be about the caller making that decision.

I think that is a misunderstanding.  The caller isn't making a decision
- except the IS_PAR_UPDATE() test which is simply acting on the fs
request.  What you are seeing is a misguided attempt to leave in place
some existing interfaces which assumed exclusive locking and didn't
provide wqs.

> 
> That's just feels horribly horribly wrong.
> 
> Yes, I realize that to you that's just a "later patches will do
> renames", but what if it really is a filesystem issue where the
> filesystem can easily handle new names, but needs something else for
> renames because it has various cross-directory issues, perhaps?

Obviously a filesystem can add its own locking - and they will have to,
though at a finer grain that the VFS can do.


> 
> So I feel this is fundamentally wrong, and this ugliness is a small
> effect of that wrongness.
> 
> I think we should strive for
> 
>  (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional

Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I
realised that you wouldn't like that :-)

> 
>  (b) aim for the inode lock being taken *after* the _lookup_hash(),
> since the VFS layer side has to be able to handle the concurrency on
> the dcache side anyway

I think you are suggesting that we change ->lookup call to NOT
require i_rwsem be held.  That is not a small change.
I agree that it makes sense in the long term.  Getting there ....  won't
be a quick as I'd hoped.

> 
>  (c) at that point, the serialization really ends up being about the
> call into the filesystem, and aim to simply move the
> inode_lock{_shared]_nested() into the filesystem so that there's no
> need for a flag and related conditional locking at all.

It might be nice to take a shared lock in VFS, and let the FS upgrade it
to exclusive if needed, but we don't have upgrade_read() ...  maybe it
would be deadlock-prone.

> 
> Because right now I think the main reason we cannot move the lock into
> the filesystem is literally that we've made the lock cover not just
> the filesystem part, but the "lookup and create dentry" part too.
> 
> But once you have that "DCACHE_PAR_LOOKUP" bit and the
> d_alloc_parallel() logic to serialize a _particular_ dentry being
> created (as opposed to serializing all the sleeping ops to that
> directly), I really think we should strive to move the locking - that
> no longer helps the VFS dcache layer - closer to the filesystem call
> and eventually into it.
> 
> Please? I think these patches are "mostly going in the right
> direction", but at the same time I feel like there's some serious
> mis-design going on.

Hmmm....  I'll dig more deeply into ->lookup and see if I can understand
the locking well enough to feel safe removing i_rwsem from it.

Thanks,
NeilBrown

  reply	other threads:[~2022-08-26 23:07 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-26  2:10 [PATCH/RFC 00/10 v5] Improve scalability of directory operations NeilBrown
2022-08-26  2:10 ` [PATCH 09/10] VFS: add LOOKUP_SILLY_RENAME NeilBrown
2022-08-27  1:21   ` Al Viro
2022-08-29  3:15     ` NeilBrown
2022-08-26  2:10 ` [PATCH 01/10] VFS: support parallel updates in the one directory NeilBrown
2022-08-26 19:06   ` Linus Torvalds
2022-08-26 23:06     ` NeilBrown [this message]
2022-08-27  0:13       ` Linus Torvalds
2022-08-27  0:23         ` Al Viro
2022-08-27 21:14         ` Al Viro
2022-08-27  0:17     ` Al Viro
2022-09-01  0:31       ` NeilBrown
2022-09-01  3:44         ` Al Viro
2022-08-27  3:43   ` Al Viro
2022-08-29  1:59     ` NeilBrown
2022-09-03  0:06       ` Al Viro
2022-09-03  1:40         ` NeilBrown
2022-09-03  2:12           ` Al Viro
2022-09-03 17:52             ` Al Viro
2022-09-04 23:33               ` NeilBrown
2022-08-26  2:10 ` [PATCH 08/10] NFSD: allow parallel creates from nfsd NeilBrown
2022-08-27  4:37   ` Al Viro
2022-08-29  3:12     ` NeilBrown
2022-08-26  2:10 ` [PATCH 05/10] VFS: export done_path_update() NeilBrown
2022-08-26  2:10 ` [PATCH 02/10] VFS: move EEXIST and ENOENT tests into lookup_hash_update() NeilBrown
2022-08-26  2:10 ` [PATCH 06/10] VFS: support concurrent renames NeilBrown
2022-08-27  4:12   ` Al Viro
2022-08-29  3:08     ` NeilBrown
2022-08-26  2:10 ` [PATCH 10/10] NFS: support parallel updates in the one directory NeilBrown
2022-08-26 15:31   ` John Stoffel
2022-08-26 23:13     ` NeilBrown
2022-08-26  2:10 ` [PATCH 03/10] VFS: move want_write checks into lookup_hash_update() NeilBrown
2022-08-27  3:48   ` Al Viro
2022-08-26  2:10 ` [PATCH 04/10] VFS: move dput() and mnt_drop_write() into done_path_update() NeilBrown
2022-08-26  2:10 ` [PATCH 07/10] VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate() NeilBrown
2022-08-26 14:42 ` [PATCH/RFC 00/10 v5] Improve scalability of directory operations John Stoffel
2022-08-26 23:30   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=166155521174.27490.456427475820966571@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=chuck.lever@oracle.com \
    --cc=daire@dneg.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@hammerspace.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).