linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09  3:22 [PATCH 0/3] Three VFS patch resends NeilBrown
  2017-11-09  3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown
  2017-11-09  3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown
@ 2017-11-09  3:22 ` NeilBrown
  2017-11-09 11:41   ` Nikolay Borisov
  2017-11-09 20:23   ` Linus Torvalds
  2 siblings, 2 replies; 21+ messages in thread
From: NeilBrown @ 2017-11-09  3:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel

d_move() will call __d_drop() and then __d_rehash()
on the dentry being moved.  This creates a small window
when the dentry appears to be unhashed.  Many tests
of d_unhashed() are made under ->d_lock and so are safe
from racing with this window, but some aren't.
In particular, getcwd() calls d_unlinked() (which calls
d_unhashed()) without d_lock protection, so it can race.

This races has been seen in practice with lustre, which uses d_move() as
part of name lookup.  See:
   https://jira.hpdd.intel.com/browse/LU-9735
It could race with a regular rename(), and result in ENOENT instead
of either the 'before' or 'after' name.

The race can be demonstrated with a simple program which
has two threads, one renaming a directory back and forth
while another calls getcwd() within that directory: it should never
fail, but does.  See:
  https://patchwork.kernel.org/patch/9455345/

We could fix this race by taking d_lock and rechecking when
d_unhashed() reports true.  Alternately when can remove the window,
which is the approach this patch takes.

When __d_drop and __d_rehash are used to move a dentry, an extra
flag is passed which causes d_hash.pprev to not be cleared, and
to not be tested.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c |   31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d5952306206b..3130d62f29c9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -471,8 +471,11 @@ static void dentry_lru_add(struct dentry *dentry)
  * reason (NFS timeouts or autofs deletes).
  *
  * __d_drop requires dentry->d_lock.
+ * ___d_drop takes an extra @moving argument.
+ * If true, d_hash.pprev is not cleared, so there is no transient d_unhashed()
+ * state.
  */
-void __d_drop(struct dentry *dentry)
+static void inline ___d_drop(struct dentry *dentry, bool moving)
 {
 	if (!d_unhashed(dentry)) {
 		struct hlist_bl_head *b;
@@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry)
 		} else
 			hlist_bl_lock(b);
 		__hlist_bl_del(&dentry->d_hash);
-		dentry->d_hash.pprev = NULL;
+		if (likely(!moving))
+			dentry->d_hash.pprev = NULL;
 		hlist_bl_unlock(b);
 		/* After this call, in-progress rcu-walk path lookup will fail. */
 		write_seqcount_invalidate(&dentry->d_seq);
 	}
 }
+
+void __d_drop(struct dentry *dentry)
+{
+	___d_drop(dentry, false);
+}
 EXPORT_SYMBOL(__d_drop);
 
 void d_drop(struct dentry *dentry)
@@ -2387,10 +2396,10 @@ void d_delete(struct dentry * dentry)
 }
 EXPORT_SYMBOL(d_delete);
 
-static void __d_rehash(struct dentry *entry)
+static void __d_rehash(struct dentry *entry, bool moving)
 {
 	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
-	BUG_ON(!d_unhashed(entry));
+	BUG_ON(!moving && !d_unhashed(entry));
 	hlist_bl_lock(b);
 	hlist_bl_add_head_rcu(&entry->d_hash, b);
 	hlist_bl_unlock(b);
@@ -2406,7 +2415,7 @@ static void __d_rehash(struct dentry *entry)
 void d_rehash(struct dentry * entry)
 {
 	spin_lock(&entry->d_lock);
-	__d_rehash(entry);
+	__d_rehash(entry, false);
 	spin_unlock(&entry->d_lock);
 }
 EXPORT_SYMBOL(d_rehash);
@@ -2580,7 +2589,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 		raw_write_seqcount_end(&dentry->d_seq);
 		fsnotify_update_flags(dentry);
 	}
-	__d_rehash(dentry);
+	__d_rehash(dentry, false);
 	if (dir)
 		end_dir_add(dir, n);
 	spin_unlock(&dentry->d_lock);
@@ -2642,7 +2651,7 @@ struct dentry *d_exact_alias(struct dentry *entry, struct inode *inode)
 			alias = NULL;
 		} else {
 			__dget_dlock(alias);
-			__d_rehash(alias);
+			__d_rehash(alias, false);
 			spin_unlock(&alias->d_lock);
 		}
 		spin_unlock(&inode->i_lock);
@@ -2828,8 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 
 	/* unhash both */
 	/* __d_drop does write_seqcount_barrier, but they're OK to nest. */
-	__d_drop(dentry);
-	__d_drop(target);
+	___d_drop(dentry, true);
+	___d_drop(target, exchange);
 
 	/* Switch the names.. */
 	if (exchange)
@@ -2838,9 +2847,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 		copy_name(dentry, target);
 
 	/* rehash in new place(s) */
-	__d_rehash(dentry);
+	__d_rehash(dentry, true);
 	if (exchange)
-		__d_rehash(target);
+		__d_rehash(target, true);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock()
  2017-11-09  3:22 [PATCH 0/3] Three VFS patch resends NeilBrown
@ 2017-11-09  3:22 ` NeilBrown
  2017-11-09  3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown
  2017-11-09  3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown
  2 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2017-11-09  3:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel

The synchronize_rcu() in namespace_unlock() is called every time
a filesystem is unmounted.  If a great many filesystems are mounted,
this can cause a noticable slow-down in, for example, system shutdown.

The sequence:
  mkdir -p /tmp/Mtest/{0..5000}
  time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done
  time umount /tmp/Mtest/*

on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and
100 seconds to unmount them.

Boot the same VM with 1 CPU and it takes 18 seconds to mount the
tmpfs filesystems, but only 36 to unmount.

If we change the synchronize_rcu() to synchronize_rcu_expedited()
the umount time on a 4-cpu VM drop to 0.6 seconds

I think this 200-fold speed up is worth the slightly higher system
impact of using synchronize_rcu_expedited().

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective)
Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/namespace.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d18deb4c410b..23cdf6c62895 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1420,7 +1420,7 @@ static void namespace_unlock(void)
 	if (likely(hlist_empty(&head)))
 		return;
 
-	synchronize_rcu();
+	synchronize_rcu_expedited();
 
 	group_pin_kill(&head);
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 0/3] Three VFS patch resends
@ 2017-11-09  3:22 NeilBrown
  2017-11-09  3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: NeilBrown @ 2017-11-09  3:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel

Hi
 these three patches are all completely separate, but are
 all related to the VFS.  I've posted them all previously
 but with, as yet, no response (except an ack from Paul McKenney
 for the RCU one).  One was, admitted, posted early last year...

 I would appreciate review and, hopefully, acceptance.

Thanks,
NeilBrown


---

NeilBrown (3):
      VFS: use synchronize_rcu_expedited() in namespace_unlock()
      Improve fairness when locking the per-superblock s_anon list
      VFS: close race between getcwd() and d_move()


 fs/dcache.c        |   42 ++++++++++++++++++++++++++++++------------
 fs/namespace.c     |    2 +-
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 4 files changed, 33 insertions(+), 13 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09  3:22 [PATCH 0/3] Three VFS patch resends NeilBrown
  2017-11-09  3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown
@ 2017-11-09  3:22 ` NeilBrown
  2017-11-09 19:52   ` Linus Torvalds
  2017-11-09  3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown
  2 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2017-11-09  3:22 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel

bit-spin-locks, as used for dcache hash chains, are not fair.
This is not a problem for the dcache hash table as different CPUs are
likely to access different entries in the hash table so high contention
is not expected.
However anonymous dentryies (created by NFSD) all live on a single hash
chain "s_anon" and the bitlock on this can be highly contended, resulting
in soft-lockup warnings.

So introduce a per-sb (fair) spinlock and take it before grabing the
bitlock on s_anon.  This provides fairness and makes the warnings go away.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c        |   11 ++++++++++-
 fs/super.c         |    1 +
 include/linux/fs.h |    1 +
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..d5952306206b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -51,6 +51,8 @@
  *   - the dcache hash table
  * s_anon bl list spinlock protects:
  *   - the s_anon list (see __d_drop)
+ * dentry->d_sb->s_anon_lock protect:
+ *   - the s_anon bl bitlock, ensuring fairness.
  * dentry->d_sb->s_dentry_lru_lock protects:
  *   - the dcache lru lists and counters
  * d_lock protects:
@@ -484,7 +486,12 @@ void __d_drop(struct dentry *dentry)
 		else
 			b = d_hash(dentry->d_name.hash);
 
-		hlist_bl_lock(b);
+		if (b == &dentry->d_sb->s_anon) {
+			spin_lock(&dentry->d_sb->s_anon_lock);
+			hlist_bl_lock(b);
+			spin_unlock(&dentry->d_sb->s_anon_lock);
+		} else
+			hlist_bl_lock(b);
 		__hlist_bl_del(&dentry->d_hash);
 		dentry->d_hash.pprev = NULL;
 		hlist_bl_unlock(b);
@@ -1965,7 +1972,9 @@ static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected)
 	spin_lock(&tmp->d_lock);
 	__d_set_inode_and_type(tmp, inode, add_flags);
 	hlist_add_head(&tmp->d_u.d_alias, &inode->i_dentry);
+	spin_lock(&tmp->d_sb->s_anon_lock);
 	hlist_bl_lock(&tmp->d_sb->s_anon);
+	spin_unlock(&tmp->d_sb->s_anon_lock);
 	hlist_bl_add_head(&tmp->d_hash, &tmp->d_sb->s_anon);
 	hlist_bl_unlock(&tmp->d_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
diff --git a/fs/super.c b/fs/super.c
index 994db21f59bf..af644ae93445 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -209,6 +209,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	if (s->s_user_ns != &init_user_ns)
 		s->s_iflags |= SB_I_NODEV;
 	INIT_HLIST_NODE(&s->s_instances);
+	spin_lock_init(&s->s_anon_lock);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 885266aae2d7..9df9bace53fb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1359,6 +1359,7 @@ struct super_block {
 
 	const struct fscrypt_operations	*s_cop;
 
+	spinlock_t		s_anon_lock;	/* needed for fairness */
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
 	struct block_device	*s_bdev;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09  3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown
@ 2017-11-09 11:41   ` Nikolay Borisov
  2017-11-09 13:08     ` Matthew Wilcox
  2017-11-09 20:23   ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Nikolay Borisov @ 2017-11-09 11:41 UTC (permalink / raw)
  To: NeilBrown, Al Viro; +Cc: linux-fsdevel, Linus Torvalds, linux-kernel



On  9.11.2017 05:22, NeilBrown wrote:
> d_move() will call __d_drop() and then __d_rehash()
> on the dentry being moved.  This creates a small window
> when the dentry appears to be unhashed.  Many tests
> of d_unhashed() are made under ->d_lock and so are safe
> from racing with this window, but some aren't.
> In particular, getcwd() calls d_unlinked() (which calls
> d_unhashed()) without d_lock protection, so it can race.
> 
> This races has been seen in practice with lustre, which uses d_move() as
> part of name lookup.  See:
>    https://jira.hpdd.intel.com/browse/LU-9735
> It could race with a regular rename(), and result in ENOENT instead
> of either the 'before' or 'after' name.
> 
> The race can be demonstrated with a simple program which
> has two threads, one renaming a directory back and forth
> while another calls getcwd() within that directory: it should never
> fail, but does.  See:
>   https://patchwork.kernel.org/patch/9455345/
> 
> We could fix this race by taking d_lock and rechecking when
> d_unhashed() reports true.  Alternately when can remove the window,
> which is the approach this patch takes.
> 
> When __d_drop and __d_rehash are used to move a dentry, an extra
> flag is passed which causes d_hash.pprev to not be cleared, and
> to not be tested.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/dcache.c |   31 ++++++++++++++++++++-----------
>  1 file changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index d5952306206b..3130d62f29c9 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -471,8 +471,11 @@ static void dentry_lru_add(struct dentry *dentry)
>   * reason (NFS timeouts or autofs deletes).
>   *
>   * __d_drop requires dentry->d_lock.
> + * ___d_drop takes an extra @moving argument.
> + * If true, d_hash.pprev is not cleared, so there is no transient d_unhashed()
> + * state.
>   */
> -void __d_drop(struct dentry *dentry)
> +static void inline ___d_drop(struct dentry *dentry, bool moving)
>  {
>  	if (!d_unhashed(dentry)) {
>  		struct hlist_bl_head *b;
> @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry)
>  		} else
>  			hlist_bl_lock(b);
>  		__hlist_bl_del(&dentry->d_hash);
> -		dentry->d_hash.pprev = NULL;
> +		if (likely(!moving))
> +			dentry->d_hash.pprev = NULL;

nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end
result is the same, however it's easy to miss the !. It's not a big deal
but just wondering.

>  		hlist_bl_unlock(b);
>  		/* After this call, in-progress rcu-walk path lookup will fail. */
>  		write_seqcount_invalidate(&dentry->d_seq);
>  	}
>  }
> +
> +void __d_drop(struct dentry *dentry)
> +{
> +	___d_drop(dentry, false);
> +}
>  EXPORT_SYMBOL(__d_drop);
>  
>  void d_drop(struct dentry *dentry)
> @@ -2387,10 +2396,10 @@ void d_delete(struct dentry * dentry)
>  }
>  EXPORT_SYMBOL(d_delete);
>  
> -static void __d_rehash(struct dentry *entry)
> +static void __d_rehash(struct dentry *entry, bool moving)
>  {
>  	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
> -	BUG_ON(!d_unhashed(entry));
> +	BUG_ON(!moving && !d_unhashed(entry));
>  	hlist_bl_lock(b);
>  	hlist_bl_add_head_rcu(&entry->d_hash, b);
>  	hlist_bl_unlock(b);
> @@ -2406,7 +2415,7 @@ static void __d_rehash(struct dentry *entry)
>  void d_rehash(struct dentry * entry)
>  {
>  	spin_lock(&entry->d_lock);
> -	__d_rehash(entry);
> +	__d_rehash(entry, false);
>  	spin_unlock(&entry->d_lock);
>  }
>  EXPORT_SYMBOL(d_rehash);
> @@ -2580,7 +2589,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
>  		raw_write_seqcount_end(&dentry->d_seq);
>  		fsnotify_update_flags(dentry);
>  	}
> -	__d_rehash(dentry);
> +	__d_rehash(dentry, false);
>  	if (dir)
>  		end_dir_add(dir, n);
>  	spin_unlock(&dentry->d_lock);
> @@ -2642,7 +2651,7 @@ struct dentry *d_exact_alias(struct dentry *entry, struct inode *inode)
>  			alias = NULL;
>  		} else {
>  			__dget_dlock(alias);
> -			__d_rehash(alias);
> +			__d_rehash(alias, false);
>  			spin_unlock(&alias->d_lock);
>  		}
>  		spin_unlock(&inode->i_lock);
> @@ -2828,8 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
>  
>  	/* unhash both */
>  	/* __d_drop does write_seqcount_barrier, but they're OK to nest. */
> -	__d_drop(dentry);
> -	__d_drop(target);
> +	___d_drop(dentry, true);
> +	___d_drop(target, exchange);
>  
>  	/* Switch the names.. */
>  	if (exchange)
> @@ -2838,9 +2847,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
>  		copy_name(dentry, target);
>  
>  	/* rehash in new place(s) */
> -	__d_rehash(dentry);
> +	__d_rehash(dentry, true);
>  	if (exchange)
> -		__d_rehash(target);
> +		__d_rehash(target, true);
>  
>  	/* ... and switch them in the tree */
>  	if (IS_ROOT(dentry)) {
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09 11:41   ` Nikolay Borisov
@ 2017-11-09 13:08     ` Matthew Wilcox
  2017-11-09 16:02       ` Nikolay Borisov
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2017-11-09 13:08 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: NeilBrown, Al Viro, linux-fsdevel, Linus Torvalds, linux-kernel

On Thu, Nov 09, 2017 at 01:41:24PM +0200, Nikolay Borisov wrote:
> On  9.11.2017 05:22, NeilBrown wrote:
> > @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry)
> >  		} else
> >  			hlist_bl_lock(b);
> >  		__hlist_bl_del(&dentry->d_hash);
> > -		dentry->d_hash.pprev = NULL;
> > +		if (likely(!moving))
> > +			dentry->d_hash.pprev = NULL;
> 
> nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end
> result is the same, however it's easy to miss the !. It's not a big deal
> but just wondering.

umm ... you just suggested the exact opposite of what the patch is
intended to do.  likely()/unlikely() only hint to the compiler the
probabilities of the branch; they don't change the meaning of the
condition.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09 13:08     ` Matthew Wilcox
@ 2017-11-09 16:02       ` Nikolay Borisov
  0 siblings, 0 replies; 21+ messages in thread
From: Nikolay Borisov @ 2017-11-09 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: NeilBrown, Al Viro, linux-fsdevel, Linus Torvalds, linux-kernel



On  9.11.2017 15:08, Matthew Wilcox wrote:
> On Thu, Nov 09, 2017 at 01:41:24PM +0200, Nikolay Borisov wrote:
>> On  9.11.2017 05:22, NeilBrown wrote:
>>> @@ -493,12 +496,18 @@ void __d_drop(struct dentry *dentry)
>>>  		} else
>>>  			hlist_bl_lock(b);
>>>  		__hlist_bl_del(&dentry->d_hash);
>>> -		dentry->d_hash.pprev = NULL;
>>> +		if (likely(!moving))
>>> +			dentry->d_hash.pprev = NULL;
>>
>> nit: isn't a bit more explicit if (unlikely(moving)). I suspect the end
>> result is the same, however it's easy to miss the !. It's not a big deal
>> but just wondering.
> 
> umm ... you just suggested the exact opposite of what the patch is
> intended to do.  likely()/unlikely() only hint to the compiler the
> probabilities of the branch; they don't change the meaning of the
> condition.

brainfart, disregard my comment doh....

> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09  3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown
@ 2017-11-09 19:52   ` Linus Torvalds
  2017-11-09 20:50     ` Al Viro
  2017-11-10  8:50     ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2017-11-09 19:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote:
> bit-spin-locks, as used for dcache hash chains, are not fair.
> This is not a problem for the dcache hash table as different CPUs are
> likely to access different entries in the hash table so high contention
> is not expected.
> However anonymous dentryies (created by NFSD) all live on a single hash
> chain "s_anon" and the bitlock on this can be highly contended, resulting
> in soft-lockup warnings.

This really seems idiotic.

Let me rephrase this commit message so that you can see why I think it's wrong:

 "NFSd uses a single hash chain for all dentries, which can cause
horrible lock contention, in ways that the normal hashing does not.

  This horrendous contention can cause the machine to have bad latency
spikes, which can cause soft lockup warnings.

  Instead of just fixing the bad design decision that causes this
horrible contention, we'll just special-case this code and use an
additional lock that hides the problem by serializing the actual
contending access".

Honestly, looking at the code, the whole s_anon thing seems entirely
broken. There doesn't even seem to be much reason for it. In pretty
much all cases, we could just hash the damn dentry,

The only reason for actually having s_anon seems to be that we want
some per-superblock list of these unconnected dentries for
shrink_dcache_for_umount().

Everything else would actually be *much* happier with just having the
dentry on the regular hash table. It would entirely get rid of this
stupid performance problem, and it would actually simplify all the
code elsewhere, because it would remove special cases like this

                if (unlikely(IS_ROOT(dentry)))
                        b = &dentry->d_sb->s_anon;
                else
                        b = d_hash(dentry->d_name.hash);

and just turn them into

                b = d_hash(dentry->d_name.hash);

so I really wonder if we could just get rid of s_anon entirely.

Yes, getting rid of s_anon might involve crazy things like "let's just
walk all the dentries at umount time", but honestly, that sounds
preferable. Especially if we can just then do something like

 - set a special flag in the superblock if we ever use __d_obtain_alias()

 - only scan all the dentries on umount if that flag is set.

Hmm?

The other alternative would be to just try to make the bitlocks
fairer. I would find that much less distasteful than this nasty hack.

I could imagine, for example, just turning the bitlocks into "spinlock
on hashed array". That's basically what we did with the page lock,
which _used_ to be basically a blocking bitlock, and where there was
no possible way to not have contention on some pages.

We could afford to have two bits for the bitlock (lock and contention)
to make something like that more viable.

But I really get the feeling that the problem here is not the locking
primitive, but that whole s_anon decision. And my gut feeling is that
it really should be fixable.

Because wouldn't it be much nicer if the horrible nfsd contention just
went away rather than be worked around?

But maybe I'm missing some _other_ reason why s_anon has to be its own
separate list?

          Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09  3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown
  2017-11-09 11:41   ` Nikolay Borisov
@ 2017-11-09 20:23   ` Linus Torvalds
  2017-11-09 22:14     ` NeilBrown
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2017-11-09 20:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote:
> d_move() will call __d_drop() and then __d_rehash()
> on the dentry being moved.  This creates a small window
> when the dentry appears to be unhashed.  Many tests
> of d_unhashed() are made under ->d_lock and so are safe
> from racing with this window, but some aren't.
> In particular, getcwd() calls d_unlinked() (which calls
> d_unhashed()) without d_lock protection, so it can race.

Hmm.

I see what you're doing, but I don't necessarily agree.

I would actually almost prefer that we simply change __d_move() itself.

The problem is that __d_move() really wants to move the hashes things
atomically, but instead of doing that it does a "unhash and then
rehash".

How nasty would it be to just expand the calls to __d_drop/__d_rehash
into __d_move itself, and take both has list locks at the same time
(with the usual ordering and checking if it's the same list, of
course).

                     Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09 19:52   ` Linus Torvalds
@ 2017-11-09 20:50     ` Al Viro
  2017-11-09 23:09       ` NeilBrown
  2017-11-10  0:02       ` Linus Torvalds
  2017-11-10  8:50     ` Christoph Hellwig
  1 sibling, 2 replies; 21+ messages in thread
From: Al Viro @ 2017-11-09 20:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: NeilBrown, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
> Honestly, looking at the code, the whole s_anon thing seems entirely
> broken. There doesn't even seem to be much reason for it. In pretty
> much all cases, we could just hash the damn dentry,
> 
> The only reason for actually having s_anon seems to be that we want
> some per-superblock list of these unconnected dentries for
> shrink_dcache_for_umount().
> 
> Everything else would actually be *much* happier with just having the
> dentry on the regular hash table. It would entirely get rid of this
> stupid performance problem, and it would actually simplify all the
> code elsewhere, because it would remove special cases like this
> 
>                 if (unlikely(IS_ROOT(dentry)))
>                         b = &dentry->d_sb->s_anon;
>                 else
>                         b = d_hash(dentry->d_name.hash);
> 
> and just turn them into
> 
>                 b = d_hash(dentry->d_name.hash);
> 
> so I really wonder if we could just get rid of s_anon entirely.
> 
> Yes, getting rid of s_anon might involve crazy things like "let's just
> walk all the dentries at umount time", but honestly, that sounds
> preferable. Especially if we can just then do something like
> 
>  - set a special flag in the superblock if we ever use __d_obtain_alias()

Automatically set for a lot of NFS mounts (whenever you mount more than one
tree from the same server, IIRC)...

>  - only scan all the dentries on umount if that flag is set.
> 
> Hmm?

That looks like a bloody painful approach, IMO.  I'm not saying I like
Neil's patch, but I doubt that "let's just walk the entire dcache on
umount" is a good idea.

I wonder if separating the d_obtain_alias() and d_obtain_root() would be
a good idea; the former outnumber the latter by many orders of magnitude.
The tricky part is that we could have a disconnected directory from
d_obtain_alias() with children already connected to it (and thus normally
hashed by d_splice_alias()) and fail to connect the whole thing to parent.

That leaves us with an orphaned tree that might stick around past the
time when we drop all references to dentries in it.  And we want to
get those hunted down and shot on umount.  Could we
	* make s_anon hold d_obtain_root ones + orphans from such
failed reconnects
	* make final dput() treat hashed IS_ROOT as "don't retain it"
	* have d_obtain_alias() put into normal hash, leaving the
"move to s_anon" part to reconnect failures.
	* keep umount side of things unchanged.

I agree that temporary insertions into ->s_anon are bogus; hell, I'm not
even sure we want to put it on _any_ list initially - we want it to look
like it's hashed, so we could set ->next to NULL and have ->pprev point
to itself.  Then normal case for d_obtain_alias() would not bother
the hash chains at all at allocation time, then have it put into the
right hash chain on reconnect.  And on reconnect failure the caller
would've moved it to orphan list (i.e. ->s_anon).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09 20:23   ` Linus Torvalds
@ 2017-11-09 22:14     ` NeilBrown
  2017-11-10  1:40       ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2017-11-09 22:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 4578 bytes --]

On Thu, Nov 09 2017, Linus Torvalds wrote:

> On Wed, Nov 8, 2017 at 7:22 PM, NeilBrown <neilb@suse.com> wrote:
>> d_move() will call __d_drop() and then __d_rehash()
>> on the dentry being moved.  This creates a small window
>> when the dentry appears to be unhashed.  Many tests
>> of d_unhashed() are made under ->d_lock and so are safe
>> from racing with this window, but some aren't.
>> In particular, getcwd() calls d_unlinked() (which calls
>> d_unhashed()) without d_lock protection, so it can race.
>
> Hmm.
>
> I see what you're doing, but I don't necessarily agree.
>
> I would actually almost prefer that we simply change __d_move() itself.
>
> The problem is that __d_move() really wants to move the hashes things
> atomically, but instead of doing that it does a "unhash and then
> rehash".
>
> How nasty would it be to just expand the calls to __d_drop/__d_rehash
> into __d_move itself, and take both has list locks at the same time
> (with the usual ordering and checking if it's the same list, of
> course).
>
>                      Linus

something like this?
I can probably do better than "b1" and "b2".
I assume target must always be hashed ??
Do you like it enough for me to make it into a real patch?
I'd probably move hlist_bl_lock_two() into list_bl.h.
I'm not generally keen on open-coding subtle code, but maybe it is
justified here.

Thanks,
NeilBrown


diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..1a329fedf23c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -472,6 +472,9 @@ static void dentry_lru_add(struct dentry *dentry)
  */
 void __d_drop(struct dentry *dentry)
 {
+	/* WARNING: any changes here should be reflected in __d_move()
+	 * which open-codes some of this functionality.
+	 */
 	if (!d_unhashed(dentry)) {
 		struct hlist_bl_head *b;
 		/*
@@ -2380,6 +2383,9 @@ EXPORT_SYMBOL(d_delete);
 
 static void __d_rehash(struct dentry *entry)
 {
+	/* WARNING: any changes here should be reflected in __d_move()
+	 * which open-codes some of this functionality.
+	 */
 	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
 	BUG_ON(!d_unhashed(entry));
 	hlist_bl_lock(b);
@@ -2796,11 +2802,23 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
  * rename_lock, the i_mutex of the source and target directories,
  * and the sb->s_vfs_rename_mutex if they differ. See lock_rename().
  */
+static void hlist_bl_lock_two(struct hlist_bl_head *b1, struct  hlist_bl_head *b2)
+{
+	if (b1 && b1 < b2)
+		hlist_bl_lock(b1);
+	if (b2)
+		hlist_bl_lock(b2);
+	if (b1 > b2)
+		hlist_bl_lock(b1);
+}
+
 static void __d_move(struct dentry *dentry, struct dentry *target,
 		     bool exchange)
 {
 	struct inode *dir = NULL;
 	unsigned n;
+	struct hlist_bl_head *b1, *b2;
+
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -2817,10 +2835,24 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_begin(&dentry->d_seq);
 	write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED);
 
-	/* unhash both */
-	/* __d_drop does write_seqcount_barrier, but they're OK to nest. */
-	__d_drop(dentry);
-	__d_drop(target);
+	/* We want to unhash both, change names, then rehash one or both.
+	 * If we use __d_drop() and __d_rehash() there will be a window
+	 * when dentry appears to be d_unhashed() which can race with lockless
+	 * checking.  So instead we open-code the important parts of __d_drop()
+	 * and __d_rehash().
+	 * @target must already be hashed, @dentry must be if @exchange.
+	 */
+	BUG_ON(d_unhashed(dentry) && exchange);
+	BUG_ON(d_unhashed(target));
+
+	b1 = d_unhashed(dentry) ? NULL : d_hash(dentry->d_name.hash);
+	b2 = d_hash(target->d_name.hash);
+	hlist_bl_lock_two(b1, b2);
+	if (b1)
+		__hlist_bl_del(&dentry->d_hash);
+	__hlist_bl_del(&target->d_hash);
+	write_seqcount_invalidate(&dentry->d_seq);
+	write_seqcount_invalidate(&target->d_seq);
 
 	/* Switch the names.. */
 	if (exchange)
@@ -2829,9 +2861,14 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 		copy_name(dentry, target);
 
 	/* rehash in new place(s) */
-	__d_rehash(dentry);
+	hlist_bl_add_head_rcu(&dentry->d_hash, b2);
 	if (exchange)
-		__d_rehash(target);
+		hlist_bl_add_head_rcu(&target->d_hash, b1);
+	else
+		target->d_hash.pprev = NULL;
+	if (b1 && b1 != b2)
+		hlist_bl_unlock(b1);
+	hlist_bl_unlock(b2);
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09 20:50     ` Al Viro
@ 2017-11-09 23:09       ` NeilBrown
  2017-11-09 23:19         ` Al Viro
  2017-11-10  0:02       ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: NeilBrown @ 2017-11-09 23:09 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 4514 bytes --]

On Thu, Nov 09 2017, Al Viro wrote:

> On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
>> Honestly, looking at the code, the whole s_anon thing seems entirely
>> broken. There doesn't even seem to be much reason for it. In pretty
>> much all cases, we could just hash the damn dentry,
>> 
>> The only reason for actually having s_anon seems to be that we want
>> some per-superblock list of these unconnected dentries for
>> shrink_dcache_for_umount().
>> 
>> Everything else would actually be *much* happier with just having the
>> dentry on the regular hash table. It would entirely get rid of this
>> stupid performance problem, and it would actually simplify all the
>> code elsewhere, because it would remove special cases like this
>> 
>>                 if (unlikely(IS_ROOT(dentry)))
>>                         b = &dentry->d_sb->s_anon;
>>                 else
>>                         b = d_hash(dentry->d_name.hash);
>> 
>> and just turn them into
>> 
>>                 b = d_hash(dentry->d_name.hash);
>> 
>> so I really wonder if we could just get rid of s_anon entirely.
>> 
>> Yes, getting rid of s_anon might involve crazy things like "let's just
>> walk all the dentries at umount time", but honestly, that sounds
>> preferable. Especially if we can just then do something like
>> 
>>  - set a special flag in the superblock if we ever use __d_obtain_alias()
>
> Automatically set for a lot of NFS mounts (whenever you mount more than one
> tree from the same server, IIRC)...
>
>>  - only scan all the dentries on umount if that flag is set.
>> 
>> Hmm?
>
> That looks like a bloody painful approach, IMO.  I'm not saying I like
> Neil's patch, but I doubt that "let's just walk the entire dcache on
> umount" is a good idea.
>
> I wonder if separating the d_obtain_alias() and d_obtain_root() would be
> a good idea; the former outnumber the latter by many orders of magnitude.
> The tricky part is that we could have a disconnected directory from
> d_obtain_alias() with children already connected to it (and thus normally
> hashed by d_splice_alias()) and fail to connect the whole thing to parent.
>
> That leaves us with an orphaned tree that might stick around past the
> time when we drop all references to dentries in it.  And we want to
> get those hunted down and shot on umount.  Could we
> 	* make s_anon hold d_obtain_root ones + orphans from such
> failed reconnects
> 	* make final dput() treat hashed IS_ROOT as "don't retain it"
> 	* have d_obtain_alias() put into normal hash, leaving the
> "move to s_anon" part to reconnect failures.
> 	* keep umount side of things unchanged.
>
> I agree that temporary insertions into ->s_anon are bogus; hell, I'm not
> even sure we want to put it on _any_ list initially - we want it to look
> like it's hashed, so we could set ->next to NULL and have ->pprev point
> to itself.  Then normal case for d_obtain_alias() would not bother
> the hash chains at all at allocation time, then have it put into the
> right hash chain on reconnect.  And on reconnect failure the caller
> would've moved it to orphan list (i.e. ->s_anon).

I looked back at the original bug report, and it was d_obtain_alias()
that was particularly suffering.  As this holds two spinlocks while
waiting for the bl_lock, a delay can affect other code that doesn't
touch s_anon.
So improving d_obtain_alias() would be a good goal (as long as we don't
just move the problem of course).

It isn't only "reconnect failure" that leaves things on s_anon.
We normally don't bother trying to connect non-directories.
So if an NFS server is getting lots of read/write request without opens
or other pathname lookups, it could easily have lots of disconnected
files being repeatedly accessed.  Keeping the dentries on d_anon means
we don't need to keep allocating new ones for every request.

So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but
it might make sense to add the dentry to the per-fs list of IS_ROOT
dentries at that time.

One possible approach would be to use d_child rather than d_hash to link
together dentries that don't have a parent.
We could assign a random number to d_name.hash so it could appear to be
hashed without imposing on any one hash chain.  We would still need a
spinlock in the superblock to manage the d_anon list that links the
d_child's together...
I might try to see how the code looks.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09 23:09       ` NeilBrown
@ 2017-11-09 23:19         ` Al Viro
  0 siblings, 0 replies; 21+ messages in thread
From: Al Viro @ 2017-11-09 23:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 10:09:09AM +1100, NeilBrown wrote:

> So if an NFS server is getting lots of read/write request without opens
> or other pathname lookups, it could easily have lots of disconnected
> files being repeatedly accessed.  Keeping the dentries on d_anon means
> we don't need to keep allocating new ones for every request.
> 
> So I'm not keen on dropping an IS_ROOT() dentry at final dput(), but
> it might make sense to add the dentry to the per-fs list of IS_ROOT
> dentries at that time.

Watch out for dput() fast path (see fast_dput()) if you go that way.

> One possible approach would be to use d_child rather than d_hash to link
> together dentries that don't have a parent.
> We could assign a random number to d_name.hash so it could appear to be
> hashed without imposing on any one hash chain.  We would still need a
> spinlock in the superblock to manage the d_anon list that links the
> d_child's together...
> I might try to see how the code looks.

Keep in mind that d_hash() includes bits of ->d_parent, aka. dentry itself.
So no need for fake ->d_name.hash; you'll get spread from that part.

->d_child is... delicate.  There are very interesting games around d_walk
vs. dput already; I'd be very careful with that one.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09 20:50     ` Al Viro
  2017-11-09 23:09       ` NeilBrown
@ 2017-11-10  0:02       ` Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2017-11-10  0:02 UTC (permalink / raw)
  To: Al Viro; +Cc: NeilBrown, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 9, 2017 at 12:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> That looks like a bloody painful approach, IMO.  I'm not saying I like
> Neil's patch, but I doubt that "let's just walk the entire dcache on
> umount" is a good idea.

There may be other approaches, but how often do you umount a
filesystem that you have NFS-exported? Not often.

So making that slower doesn't really sound horrible.

> I wonder if separating the d_obtain_alias() and d_obtain_root() would be
> a good idea; the former outnumber the latter by many orders of magnitude.
> The tricky part is that we could have a disconnected directory from
> d_obtain_alias() with children already connected to it (and thus normally
> hashed by d_splice_alias()) and fail to connect the whole thing to parent.

Well, the real issue (I think) is that we abuse that hash list for the
s_anon thing. I'd like to get rid of that part.

Now, the reason we do that is because we want to have some way to find
dentries that come from that filesystem, but aren't connected.

So I really would like to get rid of s_anon entirely.

We could easily use _another_ list entirely for holding the
"unconnected dentries". Because we do have such a list: the dentry
sibling list.

So what if we instead of using s_anon, we create a "disconnected
root", and make all the disconnected dentries be children of that
disconnected root instead?

Wouldn't that be nicer? The "reconnect" would be basically a "move
from disconnected parent to right place".

Maybe I'm missing something, but that sounds much more logical than
the current s_anon list.

> That leaves us with an orphaned tree that might stick around past the
> time when we drop all references to dentries in it.  And we want to
> get those hunted down and shot on umount.  Could we
>         * make s_anon hold d_obtain_root ones + orphans from such
> failed reconnects
>         * make final dput() treat hashed IS_ROOT as "don't retain it"
>         * have d_obtain_alias() put into normal hash, leaving the
> "move to s_anon" part to reconnect failures.
>         * keep umount side of things unchanged.

I guess that would work too, but I'm not seeing why s_anon is so wonderful.

If we want to make those entries look hashed, let's just hash them,
and use a special parent for it (that "disconnected root"). Why
wouldn't that work?

That would make the umount side _simpler_, because the disconnected
root would be handled exactly like we currently handle the real root.

So we'd just do that same "do_one_tree()" on the disconnected root,
the same way we do on the real one.

That sounds _so_ straightforward that I'm probably missing something
important. And if the "move from disconnected state" is very common,
then maybe the disconnected root lock ends up being a horrible
choke-point instead, simply because we'd take that parent lock all the
time for the add/move operations..

So maybe it's a bad idea. But it sounds clean.

             Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-09 22:14     ` NeilBrown
@ 2017-11-10  1:40       ` Linus Torvalds
  2017-11-10  4:45         ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2017-11-10  1:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 9, 2017 at 2:14 PM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Nov 09 2017, Linus Torvalds wrote:
>>
>> How nasty would it be to just expand the calls to __d_drop/__d_rehash
>> into __d_move itself, and take both has list locks at the same time
>> (with the usual ordering and checking if it's the same list, of
>> course).
>
> something like this?

Yes.

This looks nicer to me. Partly because I hate those "pass flags to
functions that modify their behavior" kinds of patches. I'd rather see
just straight-line unconditional code with some possible duplication.

That said, your conversion isn't the obvious "no semantic changes"
one. The BUG_ON() conditions you added are a bit odd.  You've taken
the BUG_ON() from __d_rehash() that no longe rmakes any sense (because
we just explicitly unhashed it), and replaced it with a BUG_ON() that
didn't exist before, and is also not the conditionm that __d_drop
actually had (or the condition that means that the hash liost might be
different - ie the whole IS_ROOT() case).

So the patch looks conceptually good., but I worry about the details.
They may be right, but that odd IS_ROOT case in __d_drop really
worries me.  Can we rename one of those disconnected entries?

Of course, that then ties into the other thread, where those
disconnected entries are a bit too special anyway.

I also do wonder if we can avoid all the unhash/rehash games entirely
(and avoid the hash list locking) if it turns out that the dentry and
target hash lists are the same.

I'm not claiming that as an optimization (it's an unusual case), I'm
more thinking that it might fall out fairly naturally from the "lock
both" case, since that one needs to check for the same list anyway.

> Do you like it enough for me to make it into a real patch?

Let's see if Al hates this approach, but I definitely prefer it
assuming my worries are groundless.

> I'd probably move hlist_bl_lock_two() into list_bl.h.

Maybe not - see the above issue where maybe the "same hash" might
actually merit different codepath.

            Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-10  1:40       ` Linus Torvalds
@ 2017-11-10  4:45         ` NeilBrown
  2017-11-10 19:52           ` Linus Torvalds
  2017-11-10 20:53           ` Al Viro
  0 siblings, 2 replies; 21+ messages in thread
From: NeilBrown @ 2017-11-10  4:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 5358 bytes --]

On Thu, Nov 09 2017, Linus Torvalds wrote:

> On Thu, Nov 9, 2017 at 2:14 PM, NeilBrown <neilb@suse.com> wrote:
>> On Thu, Nov 09 2017, Linus Torvalds wrote:
>>>
>>> How nasty would it be to just expand the calls to __d_drop/__d_rehash
>>> into __d_move itself, and take both has list locks at the same time
>>> (with the usual ordering and checking if it's the same list, of
>>> course).
>>
>> something like this?
>
> Yes.
>
> This looks nicer to me. Partly because I hate those "pass flags to
> functions that modify their behavior" kinds of patches. I'd rather see
> just straight-line unconditional code with some possible duplication.
...
>
> I also do wonder if we can avoid all the unhash/rehash games entirely
> (and avoid the hash list locking) if it turns out that the dentry and
> target hash lists are the same.

I'm not convinced.  I haven't actually tried it, but the matrix of
possibilities seems a little large.
The source dentry may or may not be hashed (not in the "disconnected
IS_ROOT" case), and the target may or may not want to be rehashed
afterwards (depending on 'exchange').  We could skip the lock
for an exchange if they both had the same hash, but not for a simple
move.

However your description of what it was that you didn't like gave me an
idea - I can take the same approach as my original, but not pass flags
around.
I quite like how this turned out.
Dropping the BUG_ON() in d_rehash() isn't ideal, maybe we could add
___d_rehash() without the BUG_ON() and call that from __d_rehash?

Thanks,
NeilBrown


From: NeilBrown <neilb@suse.com>
Date: Fri, 10 Nov 2017 15:20:06 +1100
Subject: [PATCH] VFS: close race between getcwd() and d_move()

d_move() will call __d_drop() and then __d_rehash()
on the dentry being moved.  This creates a small window
when the dentry appears to be unhashed.  Many tests
of d_unhashed() are made under ->d_lock and so are safe
from racing with this window, but some aren't.
In particular, getcwd() calls d_unlinked() (which calls
d_unhashed()) without d_lock protection, so it can race.

This races has been seen in practice with lustre, which uses d_move() as
part of name lookup.  See:
   https://jira.hpdd.intel.com/browse/LU-9735
It could race with a regular rename(), and result in ENOENT instead
of either the 'before' or 'after' name.

The race can be demonstrated with a simple program which
has two threads, one renaming a directory back and forth
while another calls getcwd() within that directory: it should never
fail, but does.  See:
  https://patchwork.kernel.org/patch/9455345/

We could fix this race by taking d_lock and rechecking when
d_unhashed() reports true.  Alternately when can remove the window,
which is the approach this patch takes.

___d_drop() is introduce which does *not* clear d_hash.pprev
so the dentry still appears to be hashed.  __d_drop() calls
___d_drop(), then clears d_hash.pprev.
__d_move() now uses ___d_drop() and only clears d_hash.pprev
when not rehashing.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dcache.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..8c83543f5065 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -468,9 +468,11 @@ static void dentry_lru_add(struct dentry *dentry)
  * d_drop() is used mainly for stuff that wants to invalidate a dentry for some
  * reason (NFS timeouts or autofs deletes).
  *
- * __d_drop requires dentry->d_lock.
+ * __d_drop requires dentry->d_lock
+ * ___d_drop doesn't mark dentry as "unhashed"
+ *   (dentry->d_hash.pprev will be LIST_POISON2, not NULL).
  */
-void __d_drop(struct dentry *dentry)
+static void ___d_drop(struct dentry *dentry)
 {
 	if (!d_unhashed(dentry)) {
 		struct hlist_bl_head *b;
@@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry)
 
 		hlist_bl_lock(b);
 		__hlist_bl_del(&dentry->d_hash);
-		dentry->d_hash.pprev = NULL;
 		hlist_bl_unlock(b);
 		/* After this call, in-progress rcu-walk path lookup will fail. */
 		write_seqcount_invalidate(&dentry->d_seq);
 	}
 }
+void __d_drop(struct dentry *dentry) {
+	___d_drop(dentry);
+	dentry->d_hash.pprev = NULL;
+}
 EXPORT_SYMBOL(__d_drop);
 
 void d_drop(struct dentry *dentry)
@@ -2381,7 +2386,7 @@ EXPORT_SYMBOL(d_delete);
 static void __d_rehash(struct dentry *entry)
 {
 	struct hlist_bl_head *b = d_hash(entry->d_name.hash);
-	BUG_ON(!d_unhashed(entry));
+
 	hlist_bl_lock(b);
 	hlist_bl_add_head_rcu(&entry->d_hash, b);
 	hlist_bl_unlock(b);
@@ -2818,9 +2823,9 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED);
 
 	/* unhash both */
-	/* __d_drop does write_seqcount_barrier, but they're OK to nest. */
-	__d_drop(dentry);
-	__d_drop(target);
+	/* ___d_drop does write_seqcount_barrier, but they're OK to nest. */
+	___d_drop(dentry);
+	___d_drop(target);
 
 	/* Switch the names.. */
 	if (exchange)
@@ -2832,6 +2837,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	__d_rehash(dentry);
 	if (exchange)
 		__d_rehash(target);
+	else
+		target->d_hash.pprev = NULL;
 
 	/* ... and switch them in the tree */
 	if (IS_ROOT(dentry)) {
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list
  2017-11-09 19:52   ` Linus Torvalds
  2017-11-09 20:50     ` Al Viro
@ 2017-11-10  8:50     ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2017-11-10  8:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 09, 2017 at 11:52:48AM -0800, Linus Torvalds wrote:
> Yes, getting rid of s_anon might involve crazy things like "let's just
> walk all the dentries at umount time", but honestly, that sounds
> preferable. Especially if we can just then do something like
> 
>  - set a special flag in the superblock if we ever use __d_obtain_alias()
> 
>  - only scan all the dentries on umount if that flag is set.
> 
> Hmm?

The scan would be extremely painful with our current global dcache.

But then again the global dcache isn't exactly a good idea to start
with.  If we had a per-sb dcache using e.g. the resizable hash table
the networking people seem to be fond off your idea would work really
well.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-10  4:45         ` NeilBrown
@ 2017-11-10 19:52           ` Linus Torvalds
  2017-11-10 20:53           ` Al Viro
  1 sibling, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2017-11-10 19:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List

On Thu, Nov 9, 2017 at 8:45 PM, NeilBrown <neilb@suse.com> wrote:
>
> However your description of what it was that you didn't like gave me an
> idea - I can take the same approach as my original, but not pass flags
> around.
> I quite like how this turned out.
> Dropping the BUG_ON() in d_rehash() isn't ideal, maybe we could add
> ___d_rehash() without the BUG_ON() and call that from __d_rehash?

This looks more palatable to me, yes.

It still worries me a bit that we end up having that dangling pprev
pointer despite having dropped the hash list lock, but as long as
nobody uses it for anything but that "is it hashed" test without
taking the dentry lock, it all *should* be safe.

Please fix the whitespace, though. This is not how we do function definitions:

    void __d_drop(struct dentry *dentry) {

but otherwise I think this patch is acceptable.

Still want commentary from Al (and preferably going through his vfs
tree for 4.15 or whatever).

             Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-10  4:45         ` NeilBrown
  2017-11-10 19:52           ` Linus Torvalds
@ 2017-11-10 20:53           ` Al Viro
  2017-11-21 23:50             ` Al Viro
  1 sibling, 1 reply; 21+ messages in thread
From: Al Viro @ 2017-11-10 20:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote:
> -void __d_drop(struct dentry *dentry)
> +static void ___d_drop(struct dentry *dentry)
>  {
>  	if (!d_unhashed(dentry)) {
>  		struct hlist_bl_head *b;
> @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry)
>  
>  		hlist_bl_lock(b);
>  		__hlist_bl_del(&dentry->d_hash);
> -		dentry->d_hash.pprev = NULL;
>  		hlist_bl_unlock(b);
>  		/* After this call, in-progress rcu-walk path lookup will fail. */
>  		write_seqcount_invalidate(&dentry->d_seq);
>  	}
>  }
> +void __d_drop(struct dentry *dentry) {
> +	___d_drop(dentry);
> +	dentry->d_hash.pprev = NULL;

Umm...  That reordering (unhashed vs. ->d_seq) might be a problem
on the RCU side.  I'm not sure it is, we might get away with that,
actually, but I want to finish digging through the pathwalk-related
code.  Cursing it for being too subtle for its own good, as usual...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-10 20:53           ` Al Viro
@ 2017-11-21 23:50             ` Al Viro
  2017-11-22  1:31               ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Al Viro @ 2017-11-21 23:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List

On Fri, Nov 10, 2017 at 08:53:28PM +0000, Al Viro wrote:
> On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote:
> > -void __d_drop(struct dentry *dentry)
> > +static void ___d_drop(struct dentry *dentry)
> >  {
> >  	if (!d_unhashed(dentry)) {
> >  		struct hlist_bl_head *b;
> > @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry)
> >  
> >  		hlist_bl_lock(b);
> >  		__hlist_bl_del(&dentry->d_hash);
> > -		dentry->d_hash.pprev = NULL;
> >  		hlist_bl_unlock(b);
> >  		/* After this call, in-progress rcu-walk path lookup will fail. */
> >  		write_seqcount_invalidate(&dentry->d_seq);
> >  	}
> >  }
> > +void __d_drop(struct dentry *dentry) {
> > +	___d_drop(dentry);
> > +	dentry->d_hash.pprev = NULL;
> 
> Umm...  That reordering (unhashed vs. ->d_seq) might be a problem
> on the RCU side.  I'm not sure it is, we might get away with that,
> actually, but I want to finish digging through the pathwalk-related
> code.  Cursing it for being too subtle for its own good, as usual...

OK, I believe that it's survivable, but I'd prefer to keep in -next
for a while and give it more testing.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] VFS: close race between getcwd() and d_move()
  2017-11-21 23:50             ` Al Viro
@ 2017-11-22  1:31               ` NeilBrown
  0 siblings, 0 replies; 21+ messages in thread
From: NeilBrown @ 2017-11-22  1:31 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]

On Tue, Nov 21 2017, Al Viro wrote:

> On Fri, Nov 10, 2017 at 08:53:28PM +0000, Al Viro wrote:
>> On Fri, Nov 10, 2017 at 03:45:41PM +1100, NeilBrown wrote:
>> > -void __d_drop(struct dentry *dentry)
>> > +static void ___d_drop(struct dentry *dentry)
>> >  {
>> >  	if (!d_unhashed(dentry)) {
>> >  		struct hlist_bl_head *b;
>> > @@ -486,12 +488,15 @@ void __d_drop(struct dentry *dentry)
>> >  
>> >  		hlist_bl_lock(b);
>> >  		__hlist_bl_del(&dentry->d_hash);
>> > -		dentry->d_hash.pprev = NULL;
>> >  		hlist_bl_unlock(b);
>> >  		/* After this call, in-progress rcu-walk path lookup will fail. */
>> >  		write_seqcount_invalidate(&dentry->d_seq);
>> >  	}
>> >  }
>> > +void __d_drop(struct dentry *dentry) {
>> > +	___d_drop(dentry);
>> > +	dentry->d_hash.pprev = NULL;
>> 
>> Umm...  That reordering (unhashed vs. ->d_seq) might be a problem
>> on the RCU side.  I'm not sure it is, we might get away with that,
>> actually, but I want to finish digging through the pathwalk-related
>> code.  Cursing it for being too subtle for its own good, as usual...
>
> OK, I believe that it's survivable, but I'd prefer to keep in -next
> for a while and give it more testing.

Great, thanks.  I assume you will fix the silly '{' at the end of the
line when defining __d_drop().  Let me know if you would rather I
resend.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-11-22  1:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-09  3:22 [PATCH 0/3] Three VFS patch resends NeilBrown
2017-11-09  3:22 ` [PATCH 1/3] VFS: use synchronize_rcu_expedited() in namespace_unlock() NeilBrown
2017-11-09  3:22 ` [PATCH 2/3] Improve fairness when locking the per-superblock s_anon list NeilBrown
2017-11-09 19:52   ` Linus Torvalds
2017-11-09 20:50     ` Al Viro
2017-11-09 23:09       ` NeilBrown
2017-11-09 23:19         ` Al Viro
2017-11-10  0:02       ` Linus Torvalds
2017-11-10  8:50     ` Christoph Hellwig
2017-11-09  3:22 ` [PATCH 3/3] VFS: close race between getcwd() and d_move() NeilBrown
2017-11-09 11:41   ` Nikolay Borisov
2017-11-09 13:08     ` Matthew Wilcox
2017-11-09 16:02       ` Nikolay Borisov
2017-11-09 20:23   ` Linus Torvalds
2017-11-09 22:14     ` NeilBrown
2017-11-10  1:40       ` Linus Torvalds
2017-11-10  4:45         ` NeilBrown
2017-11-10 19:52           ` Linus Torvalds
2017-11-10 20:53           ` Al Viro
2017-11-21 23:50             ` Al Viro
2017-11-22  1:31               ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).