All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Jonathan Corbet <corbet@lwn.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH] Documentation: update path-lookup.md for parallel lookups
Date: Mon, 19 Nov 2018 11:55:46 +1100	[thread overview]
Message-ID: <87k1l9dgx9.fsf@notabene.neil.brown.name> (raw)

[-- Attachment #1: Type: text/plain, Size: 9611 bytes --]


Since this document was written, i_mutex has been replace with
i_rwsem, and shared locks are utilized to allow lookups in the one
directory to happen in parallel.

So replace i_mutex with i_rwsem, and explain how this is used for
parallel lookups.

Signed-off-by: NeilBrown <neilb@suse.com>
---

Just FYI

$ git grep -w i_mutex | wc -l
262

it was 276 before this patch... so a small improvement.

NeilBrown


 Documentation/filesystems/path-lookup.md | 85 ++++++++++++++++++------
 1 file changed, 66 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md
index e2edd45c4bc0..06151b178f80 100644
--- a/Documentation/filesystems/path-lookup.md
+++ b/Documentation/filesystems/path-lookup.md
@@ -12,6 +12,10 @@ This write-up is based on three articles published at lwn.net:
 - <https://lwn.net/Articles/650786/> A walk among the symlinks
 
 Written by Neil Brown with help from Al Viro and Jon Corbet.
+It has subsequently been updated to reflect changes in the kernel
+including:
+
+- per-directory parallel name lookup.
 
 Introduction
 ------------
@@ -231,37 +235,80 @@ renamed.  If `d_lookup` finds that a rename happened while it
 unsuccessfully scanned a chain in the hash table, it simply tries
 again.
 
-### inode->i_mutex ###
+### inode->i_rwsem ###
 
-`i_mutex` is a mutex that serializes all changes to a particular
+`i_rwsem` is a read/write semaphore that serializes all changes to a particular
 directory.  This ensures that, for example, an `unlink()` and a `rename()`
 cannot both happen at the same time.  It also keeps the directory
 stable while the filesystem is asked to look up a name that is not
-currently in the dcache.
+currently in the dcache or, optionally, when the list of entries in a
+directory is being retrieved with `readdir()`.
 
-This has a complementary role to that of `d_lock`: `i_mutex` on a
+This has a complementary role to that of `d_lock`: `i_rwsem` on a
 directory protects all of the names in that directory, while `d_lock`
 on a name protects just one name in a directory.  Most changes to the
-dcache hold `i_mutex` on the relevant directory inode and briefly take
+dcache hold `i_rwsem` on the relevant directory inode and briefly take
 `d_lock` on one or more the dentries while the change happens.  One
 exception is when idle dentries are removed from the dcache due to
-memory pressure.  This uses `d_lock`, but `i_mutex` plays no role.
+memory pressure.  This uses `d_lock`, but `i_rwsem` plays no role.
 
-The mutex affects pathname lookup in two distinct ways.  Firstly it
-serializes lookup of a name in a directory.  `walk_component()` uses
+The semaphore affects pathname lookup in two distinct ways.  Firstly it
+prevents changes during lookup of a name in a directory.  `walk_component()` uses
 `lookup_fast()` first which, in turn, checks to see if the name is in the cache,
 using only `d_lock` locking.  If the name isn't found, then `walk_component()`
-falls back to `lookup_slow()` which takes `i_mutex`, checks again that
+falls back to `lookup_slow()` which takes a shared lock on `i_rwsem`, checks again that
 the name isn't in the cache, and then calls in to the filesystem to get a
 definitive answer.  A new dentry will be added to the cache regardless of
 the result.
 
 Secondly, when pathname lookup reaches the final component, it will
-sometimes need to take `i_mutex` before performing the last lookup so
+sometimes need to take an exclusive lock on `i_rwsem` before performing the last lookup so
 that the required exclusion can be achieved.  How path lookup chooses
-to take, or not take, `i_mutex` is one of the
+to take, or not take, `i_rwsem` is one of the
 issues addressed in a subsequent section.
 
+If two threads attempt to look up the same name at the same time - a
+name that is not yet in the dcache - the shared lock on `i_rwsem` will
+not prevent them both adding new dentries with the same name.  As this
+would result in confusion an extra level of interlocking is used,
+based around a secondary hash table (`in_lookup_hashtable`) and a
+per-dentry flag bit (`DCACHE_PAR_LOOKUP`).
+
+To add a new dentry to the cache while only holding a shared lock on
+`i_rwsem`, a thread must call `d_alloc_parallel()`.  This allocates a
+dentry, stores the required name and parent in it, checks if there
+is already a matching dentry in the primary or secondary hash
+tables, and if not, stores the newly allocated dentry in the secondary
+hash table, with `DCACHE_PAR_LOOKUP` set.
+
+If a matching dentry was found in the primary hash table then that is
+returned and the caller can know that it lost a race with some other
+thread adding the entry.  If no matching dentry is found in either
+cache, the newly allocated dentry is returned and the caller can
+detect this from the presence of `DCACHE_PAR_LOOKUP`.  In this case it
+knows that it has won any race and now is responsible for asking the
+filesystem to perform the lookup and find the matching inode.  When
+the lookup is complete, it must call `d_lookup_done()` which clears
+the flag and does some other house keeping, including removing the
+dentry from the secondary hash table - it will normally have been
+added to the primary hash table already.  Note that a `struct
+waitqueue_head` is passed to `d_alloc_parallel()`, and
+`d_lookup_done()` must be called while this `waitqueue_head` is still
+in scope.
+
+If a matching dentry is found in the secondary hash table,
+`d_alloc_parallel()` has a little more work to do. It first waits for
+`DCACHE_PAR_LOOKUP` to be cleared, using a wait_queue that was passed
+to the instance of `d_alloc_parallel()` that won the race and that
+will be woken by the call to `d_lookup_done()`.  It then checks to see
+if the dentry has now been added to the primary hash table.  If it
+has, the dentry is returned and the caller just sees that it lost any
+race.  If it hasn't been added to the primary hash table, the most
+likely explanation is that some other dentry was added instead using
+`d_splice_alias()`.  In any case, `d_alloc_parallel()` repeats all the
+look ups from the start and will normally return something from the
+primary hash table.
+
 ### mnt->mnt_count ###
 
 `mnt_count` is a per-CPU reference counter on "`mount`" structures.
@@ -376,7 +423,7 @@ described.  If it finds a `LAST_NORM` component it first calls
 "`lookup_fast()`" which only looks in the dcache, but will ask the
 filesystem to revalidate the result if it is that sort of filesystem.
 If that doesn't get a good result, it calls "`lookup_slow()`" which
-takes the `i_mutex`, rechecks the cache, and then asks the filesystem
+takes `i_rwsem`, rechecks the cache, and then asks the filesystem
 to find a definitive answer.  Each of these will call
 `follow_managed()` (as described below) to handle any mount points.
 
@@ -408,7 +455,7 @@ of housekeeping around `link_path_walk()` and returns the parent
 directory and final component to the caller.  The caller will be either
 aiming to create a name (via `filename_create()`) or remove or rename
 a name (in which case `user_path_parent()` is used).  They will use
-`i_mutex` to exclude other changes while they validate and then
+`i_rwsem` to exclude other changes while they validate and then
 perform their operation.
 
 `path_lookupat()` is nearly as simple - it is used when an existing
@@ -429,7 +476,7 @@ complexity needed to handle the different subtleties of O_CREAT (with
 or without O_EXCL), final "`/`" characters, and trailing symbolic
 links.  We will revisit this in the final part of this series, which
 focuses on those symbolic links.  "`do_last()`" will sometimes, but
-not always, take `i_mutex`, depending on what it finds.
+not always, take `i_rwsem`, depending on what it finds.
 
 Each of these, or the functions which call them, need to be alert to
 the possibility that the final component is not `LAST_NORM`.  If the
@@ -728,12 +775,12 @@ checking the `seq` number of the old exactly mirrors the process of
 getting a counted reference to the new dentry before dropping that for
 the old dentry which we saw in REF-walk.
 
-### No `inode->i_mutex` or even `rename_lock` ###
+### No `inode->i_rwsem` or even `rename_lock` ###
 
-A mutex is a fairly heavyweight lock that can only be taken when it is
+A semaphore is a fairly heavyweight lock that can only be taken when it is
 permissible to sleep.  As `rcu_read_lock()` forbids sleeping,
-`inode->i_mutex` plays no role in RCU-walk.  If some other thread does
-take `i_mutex` and modifies the directory in a way that RCU-walk needs
+`inode->i_rwsem` plays no role in RCU-walk.  If some other thread does
+take `i_rwsem` and modifies the directory in a way that RCU-walk needs
 to notice, the result will be either that RCU-walk fails to find the
 dentry that it is looking for, or it will find a dentry which
 `read_seqretry()` won't validate.  In either case it will drop down to
@@ -1134,7 +1181,7 @@ and `do_last()`, each of which use the same convention as
 to be followed.
 
 Of these, `do_last()` is the most interesting as it is used for
-opening a file.  Part of `do_last()` runs with `i_mutex` held and this
+opening a file.  Part of `do_last()` runs with `i_rwsem` held and this
 part is in a separate function: `lookup_open()`.
 
 Explaining `do_last()` completely is beyond the scope of this article,
-- 
2.19.0.216.g2d3b1c576c85


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

             reply	other threads:[~2018-11-19  0:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-19  0:55 NeilBrown [this message]
2018-11-20 16:38 ` [PATCH] Documentation: update path-lookup.md for parallel lookups Jonathan Corbet
2018-12-04 23:02 ` [PATCH 1/2] Documentation: convert path-lookup from markdown to resturctured text NeilBrown
2018-12-04 23:04   ` [PATCH 2/2] Documentation: path-lookup - update externel refs NeilBrown
2018-12-05 19:46   ` [PATCH 1/2] Documentation: convert path-lookup from markdown to resturctured text Jonathan Corbet
2018-12-05 21:00     ` NeilBrown
2018-12-06 17:09       ` Jonathan Corbet
2018-12-09 22:58         ` NeilBrown
2018-12-20 15:54           ` Jonathan Corbet
2018-12-21  0:33             ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k1l9dgx9.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.