linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] fs/dcache: avoid trylock loops
@ 2018-02-22 23:50 John Ogness
  2018-02-22 23:50 ` [PATCH v2 1/6] fs/dcache: Remove stale comment from dentry_kill() John Ogness
                   ` (6 more replies)
  0 siblings, 7 replies; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

Thank you Al Viro, Linus Torvalds, Peter Zijlstra, Amir Goldstein
for the detailed feedback.

Changes in v2:
. dentry_lock_inode()
  - added quick out on trylock success
  - added comments to rcu section
  - expanded rcu coverage to all inode usage
  - if inode changes during d_lock window,
    restart the function (do not return)
  - track the dentry refcount and return false if it changes
. dentry_kill()
  - do not expect refcount=1 from caller
  - if refcount changes during d_lock window,
    return ERR_PTR(-EINTR) and do not drop d_lock
. dentry_put_kill() [new helper function]
  - dentry_kill() wrapper to implement the v1 dentry_kill()
    semantics needed by dput() and shrink_dentry_list()
. dput()
  - use new dentry_put_kill() instead of dentry_kill()
. d_delete()
  - refactor code flow using reworked dentry_lock_inode()
. shrink_dentry_list()
  - use dentry_kill() for dispose list killing
  - use dentry_put_kill() for ancestor pruning

This patchset removes all trylock loops from within the dcache code.

The functions d_delete(), dentry_kill(), and shrink_dentry_list()
hold dentry->d_lock while needing to acquire dentry->d_inode->i_lock.
The latter 2 functions also need to acquire dentry->d_parent->d_lock.
This cannot be done directly with spin_lock() operations because it
is the reverse of the regular lock order. To avoid ABBA deadlocks it
is done using trylock loops.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock
     loop it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

The trylock loops can be avoided with functionality similar to
lock_parent(); temporarily dropping dentry->d_lock while holding
the rcu_read_lock so that the desired locks can be acquired in the
correct order. After the locks are acquired it is necessary to verify
that the relevant dentry members did not change while dentry->d_lock
was dropped. If they changed, the whole operation must be repeated.

John Ogness (6):
  fs/dcache: Remove stale comment from dentry_kill()
  fs/dcache: Move dentry_kill() below lock_parent()
  fs/dcache: Avoid the try_lock loop in d_delete()
  fs/dcache: Avoid the try_lock loops in dentry_kill()
  fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
  fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()

 fs/dcache.c | 273 ++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 183 insertions(+), 90 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 1/6] fs/dcache: Remove stale comment from dentry_kill()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-22 23:50 ` [PATCH v2 2/6] fs/dcache: Move dentry_kill() below lock_parent() John Ogness
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

Commit 0d98439ea3c6 ("vfs: use lockred "dead" flag to mark unrecoverably
dead dentries") removed the `ref' parameter in dentry_kill() but its
documentation remained. Remove it.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..3c7e0148b453 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -592,7 +592,6 @@ static void __dentry_kill(struct dentry *dentry)
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
- * If ref is non-zero, then decrement the refcount too.
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *dentry_kill(struct dentry *dentry)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 2/6] fs/dcache: Move dentry_kill() below lock_parent()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
  2018-02-22 23:50 ` [PATCH v2 1/6] fs/dcache: Remove stale comment from dentry_kill() John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-22 23:50 ` [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete() John Ogness
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

A subsequent patch will modify dentry_kill() to call lock_parent().
Move the dentry_kill() implementation "as is" below lock_parent()
first. This will help simplify the review of the subsequent patch
with dentry_kill() changes.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 62 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3c7e0148b453..9fed398687c9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -589,37 +589,6 @@ static void __dentry_kill(struct dentry *dentry)
 		dentry_free(dentry);
 }
 
-/*
- * Finish off a dentry we've decided to kill.
- * dentry->d_lock must be held, returns with it unlocked.
- * Returns dentry requiring refcount drop, or NULL if we're done.
- */
-static struct dentry *dentry_kill(struct dentry *dentry)
-	__releases(dentry->d_lock)
-{
-	struct inode *inode = dentry->d_inode;
-	struct dentry *parent = NULL;
-
-	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
-		goto failed;
-
-	if (!IS_ROOT(dentry)) {
-		parent = dentry->d_parent;
-		if (unlikely(!spin_trylock(&parent->d_lock))) {
-			if (inode)
-				spin_unlock(&inode->i_lock);
-			goto failed;
-		}
-	}
-
-	__dentry_kill(dentry);
-	return parent;
-
-failed:
-	spin_unlock(&dentry->d_lock);
-	return dentry; /* try again with same dentry */
-}
-
 static inline struct dentry *lock_parent(struct dentry *dentry)
 {
 	struct dentry *parent = dentry->d_parent;
@@ -655,6 +624,37 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 }
 
 /*
+ * Finish off a dentry we've decided to kill.
+ * dentry->d_lock must be held, returns with it unlocked.
+ * Returns dentry requiring refcount drop, or NULL if we're done.
+ */
+static struct dentry *dentry_kill(struct dentry *dentry)
+	__releases(dentry->d_lock)
+{
+	struct inode *inode = dentry->d_inode;
+	struct dentry *parent = NULL;
+
+	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
+		goto failed;
+
+	if (!IS_ROOT(dentry)) {
+		parent = dentry->d_parent;
+		if (unlikely(!spin_trylock(&parent->d_lock))) {
+			if (inode)
+				spin_unlock(&inode->i_lock);
+			goto failed;
+		}
+	}
+
+	__dentry_kill(dentry);
+	return parent;
+
+failed:
+	spin_unlock(&dentry->d_lock);
+	return dentry; /* try again with same dentry */
+}
+
+/*
  * Try to do a lockless dput(), and return whether that was successful.
  *
  * If unsuccessful, we return false, having already taken the dentry lock.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
  2018-02-22 23:50 ` [PATCH v2 1/6] fs/dcache: Remove stale comment from dentry_kill() John Ogness
  2018-02-22 23:50 ` [PATCH v2 2/6] fs/dcache: Move dentry_kill() below lock_parent() John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-23  2:08   ` Al Viro
  2018-02-22 23:50 ` [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill() John Ogness
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

d_delete() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order. To avoid
the ABBA deadlock it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

The trylock loop can be avoided with functionality similar to
lock_parent(). The fast path tries the trylock first, which is likely
to succeed. In the contended case it attempts locking in the correct
order. This requires to drop dentry->d_lock first, which allows
another task to free d_inode. This can be prevented by the following
mechanism:

   inode = dentry->d_inode;
   rcu_read_lock();        <-- Protects d_inode from being freed,
                               i.e. inode is a valid pointer even
                               after dentry->d_lock is dropped
   unlock(dentry->d_lock);
   lock(inode->i_lock);
   lock(&dentry->d_lock);
   if (error)
       unlock(inode->i_lock);
   rcu_read_unlock();

After the locks are acquired it's necessary to verify whether
dentry->d_inode is still pointing to inode as it might have been
changed after dropping dentry->d_lock. If it matches d_delete() can
proceed, if not the whole operation has to be repeated.

Implement this in a new function dentry_lock_inode() which will be
used in a subsequent patch as well.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9fed398687c9..bfdf1ff237f2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -623,6 +623,71 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 	return parent;
 }
 
+/**
+ * dentry_lock_inode - Lock dentry->d_inode->i_lock
+ * @dentry: The dentry to operate on
+ *
+ * Tries to acquire @dentry->d_inode->i_lock with a trylock first. If
+ * that fails it retries in correct lock order, which requires dropping
+ * @dentry->d_lock under RCU protection and then reacquiring it after
+ * locking @dentry->d_inode->i_lock.
+ *
+ * If @dentry->d_lockref.count changes while trying to acquire
+ * @dentry->d_inode->i_lock, drop @dentry->d_inode->i_lock and return
+ * false. Otherwise return true.
+ *
+ * Note that all relevant struct members of @dentry must be reevaluated by
+ * the caller since @dentry->d_lock might have been temporarily dropped.
+ */
+static bool dentry_lock_inode(struct dentry *dentry)
+{
+	int saved_count = dentry->d_lockref.count;
+	struct inode *inode;
+
+	lockdep_assert_held(&dentry->d_lock);
+again:
+	inode = dentry->d_inode;
+	if (likely(spin_trylock(&inode->i_lock)))
+		return true;
+
+	/*
+	 * The inode struct pointed to by "inode" is protected by RCU,
+	 * i.e. destroy_inode() uses call_rcu() to reclaim the memory.
+	 * Using rcu_read_lock() ensures that the inode struct remains
+	 * valid after dropping @dentry->d_lock, independent of whether
+	 * or not @dentry->d_inode continues to point to that inode.
+	 */
+	rcu_read_lock();
+
+	spin_unlock(&dentry->d_lock);
+	spin_lock(&inode->i_lock);
+	spin_lock(&dentry->d_lock);
+
+	/*
+	 * @dentry->d_lockref.count might have changed after dropping
+	 * @dentry->d_lock. If so, release @inode->i_lock and tell caller.
+	 */
+	if (unlikely(dentry->d_lockref.count != saved_count)) {
+		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
+		return false;
+	}
+
+	/*
+	 * @dentry->d_inode might have changed after dropping @dentry->d_lock.
+	 * If so, release @inode->i_lock and restart.
+	 */
+	if (unlikely(inode != dentry->d_inode)) {
+		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
+		goto again;
+	}
+
+	rcu_read_unlock();
+
+	return true;
+}
+
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
@@ -2373,32 +2438,27 @@ EXPORT_SYMBOL(d_hash_and_lookup);
  
 void d_delete(struct dentry * dentry)
 {
-	struct inode *inode;
-	int isdir = 0;
+	int isdir;
 	/*
 	 * Are we the only user?
 	 */
-again:
+
 	spin_lock(&dentry->d_lock);
-	inode = dentry->d_inode;
-	isdir = S_ISDIR(inode->i_mode);
-	if (dentry->d_lockref.count == 1) {
-		if (!spin_trylock(&inode->i_lock)) {
-			spin_unlock(&dentry->d_lock);
-			cpu_relax();
-			goto again;
-		}
-		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
-		dentry_unlink_inode(dentry);
-		fsnotify_nameremove(dentry, isdir);
-		return;
-	}
 
+	if (dentry->d_lockref.count > 1 || !dentry_lock_inode(dentry))
+		goto drop;
+
+	dentry->d_flags &= ~DCACHE_CANT_MOUNT;
+	isdir = S_ISDIR(dentry->d_inode->i_mode);
+	dentry_unlink_inode(dentry);
+	fsnotify_nameremove(dentry, isdir);
+	return;
+drop:
 	if (!d_unhashed(dentry))
 		__d_drop(dentry);
 
+	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	spin_unlock(&dentry->d_lock);
-
 	fsnotify_nameremove(dentry, isdir);
 }
 EXPORT_SYMBOL(d_delete);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
                   ` (2 preceding siblings ...)
  2018-02-22 23:50 ` [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete() John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-23  2:22   ` Al Viro
  2018-02-22 23:50 ` [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list() John Ogness
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

dentry_kill() holds dentry->d_lock and needs to acquire both
dentry->d_inode->i_lock and dentry->d_parent->d_lock. This cannot be
done with spin_lock() operations because it's the reverse of the
regular lock order. To avoid ABBA deadlocks it is done with two
trylock loops.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loops by using dentry_lock_inode() and lock_parent()
which take the locks in the appropriate order. As both functions might
drop dentry->lock briefly, this requires rechecking of the dentry
content as it might have changed while the lock was dropped.
dentry_lock_inode() performs the checks internally, but lock_parent()
relies on the caller to perform the checks.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 54 insertions(+), 16 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index bfdf1ff237f2..082361939b84 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -696,27 +696,67 @@ static bool dentry_lock_inode(struct dentry *dentry)
 static struct dentry *dentry_kill(struct dentry *dentry)
 	__releases(dentry->d_lock)
 {
-	struct inode *inode = dentry->d_inode;
-	struct dentry *parent = NULL;
+	int saved_count = dentry->d_lockref.count;
+	struct dentry *parent;
+	struct inode *inode;
 
-	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
-		goto failed;
+again:
+	inode = dentry->d_inode;
 
-	if (!IS_ROOT(dentry)) {
-		parent = dentry->d_parent;
-		if (unlikely(!spin_trylock(&parent->d_lock))) {
-			if (inode)
-				spin_unlock(&inode->i_lock);
-			goto failed;
-		}
+	/*
+	 * Lock the inode. It will fail if the refcount
+	 * changed while trying to lock the inode.
+	 */
+	if (inode && !dentry_lock_inode(dentry))
+		goto out_ref_changed;
+
+	parent = lock_parent(dentry);
+
+	/*
+	 * Check refcount because it might have changed
+	 * if d_lock was temporarily dropped.
+	 */
+	if (unlikely(dentry->d_lockref.count != saved_count)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		if (inode)
+			spin_unlock(&inode->i_lock);
+		goto out_ref_changed;
+	}
+
+	/*
+	 * d_inode might have changed if d_lock was temporarily
+	 * dropped. If it changed it is necessary to start over
+	 * because a wrong inode (or no inode) lock is held.
+	 */
+	if (unlikely(inode != dentry->d_inode)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		if (inode)
+			spin_unlock(&inode->i_lock);
+		goto again;
 	}
 
 	__dentry_kill(dentry);
 	return parent;
 
-failed:
+out_ref_changed:
+	/*
+	 * The refcount was incremented while dentry->d_lock was dropped.
+	 * Just decrement the refcount, unlock, and tell the caller to
+	 * stop the directory walk.
+	 *
+	 * For paranoia reasons check whether the refcount is < 1. If so,
+	 * report the detection and avoid the decrement which would just
+	 * cause a problem in some other place. The warning might be
+	 * helpful to decode the root cause of the refcounting bug.
+	 */
+	if (!WARN_ON(dentry->d_lockref.count < 1))
+		dentry->d_lockref.count--;
+
 	spin_unlock(&dentry->d_lock);
-	return dentry; /* try again with same dentry */
+
+	return NULL;
 }
 
 /*
@@ -888,10 +928,8 @@ void dput(struct dentry *dentry)
 
 kill_it:
 	dentry = dentry_kill(dentry);
-	if (dentry) {
-		cond_resched();
+	if (dentry)
 		goto repeat;
-	}
 }
 EXPORT_SYMBOL(dput);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
                   ` (3 preceding siblings ...)
  2018-02-22 23:50 ` [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill() John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-23  3:48   ` Al Viro
  2018-02-22 23:50 ` [PATCH v2 6/6] fs/dcache: Avoid remaining " John Ogness
  2018-02-23  0:59 ` [PATCH v2 0/6] fs/dcache: avoid trylock loops Linus Torvalds
  6 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 082361939b84..e470d49daa54 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1130,26 +1130,8 @@ static void shrink_dentry_list(struct list_head *list)
 		 * fragmentation.
 		 */
 		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref)) {
-			parent = lock_parent(dentry);
-			if (dentry->d_lockref.count != 1) {
-				dentry->d_lockref.count--;
-				spin_unlock(&dentry->d_lock);
-				if (parent)
-					spin_unlock(&parent->d_lock);
-				break;
-			}
-			inode = dentry->d_inode;	/* can't be NULL */
-			if (unlikely(!spin_trylock(&inode->i_lock))) {
-				spin_unlock(&dentry->d_lock);
-				if (parent)
-					spin_unlock(&parent->d_lock);
-				cpu_relax();
-				continue;
-			}
-			__dentry_kill(dentry);
-			dentry = parent;
-		}
+		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+			dentry = dentry_kill(dentry);
 	}
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
                   ` (4 preceding siblings ...)
  2018-02-22 23:50 ` [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list() John Ogness
@ 2018-02-22 23:50 ` John Ogness
  2018-02-23  3:58   ` Al Viro
  2018-02-23  0:59 ` [PATCH v2 0/6] fs/dcache: avoid trylock loops Linus Torvalds
  6 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-22 23:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel

shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.

Trylock loops are problematic in two scenarios:

  1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
     preemptible. As a consequence the i_lock holder can be preempted
     by a higher priority task. If that task executes the trylock loop
     it will do so forever and live lock.

  2) In virtual machines trylock loops are problematic as well. The
     VCPU on which the i_lock holder runs can be scheduled out and a
     task on a different VCPU can loop for a whole time slice. In the
     worst case this can lead to starvation. Commits 47be61845c77
     ("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45f9
     ("shrink_dentry_list(): take parent's d_lock earlier") are
     addressing exactly those symptoms.

Avoid the trylock loop by using dentry_kill(). When killing dentries
from the dispose list, it is very similar to killing a dentry in
dput(). The difference is that dput() expects to be the last user of
the dentry (refcount=1) and will deref whereas shrink_dentry_list()
expects there to be no user (refcount=0). In order to handle both
situations with the same code, move the deref code from dentry_kill()
into a new wrapper function dentry_put_kill(), which can be used
by previous dentry_kill() users. Then shrink_dentry_list() can use
the dentry_kill() to cleanup the dispose list.

This also has the benefit that the locking order is now the same.
First the inode is locked, then the parent.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 fs/dcache.c | 58 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e470d49daa54..23a90a64d74f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -690,11 +690,17 @@ static bool dentry_lock_inode(struct dentry *dentry)
 
 /*
  * Finish off a dentry we've decided to kill.
- * dentry->d_lock must be held, returns with it unlocked.
- * Returns dentry requiring refcount drop, or NULL if we're done.
+ * dentry->d_lock must be held and returns with it unlocked if the
+ * dentry was killed.
+ *
+ * Returns parent dentry requiring refcount drop or NULL if we're done
+ * or ERR_PTR(-EINTR) if the dentry was not killed because its refcount
+ * changed while preparing to kill.
+ *
+ * Note, if the dentry was not killed, i.e. ERR_PTR(-EINTR) returned,
+ * dentry->d_lock is left locked!
  */
 static struct dentry *dentry_kill(struct dentry *dentry)
-	__releases(dentry->d_lock)
 {
 	int saved_count = dentry->d_lockref.count;
 	struct dentry *parent;
@@ -741,6 +747,27 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 	return parent;
 
 out_ref_changed:
+	/* dentry->d_lock still locked! */
+	return ERR_PTR(-EINTR);
+}
+
+/*
+ * Finish off a dentry where we are the last user (refcount=1) and
+ * we've decided to kill it.
+ * dentry->d_lock must be held, returns with it unlocked.
+ * Returns dentry requiring refcount drop, or NULL if we're done.
+ */
+static struct dentry *dentry_put_kill(struct dentry *dentry)
+	__releases(dentry->d_lock)
+{
+	struct dentry *parent;
+
+	parent = dentry_kill(dentry);
+	if (likely(!IS_ERR(parent)))
+		return parent;
+
+	/* dentry->d_lock still held */
+
 	/*
 	 * The refcount was incremented while dentry->d_lock was dropped.
 	 * Just decrement the refcount, unlock, and tell the caller to
@@ -927,7 +954,7 @@ void dput(struct dentry *dentry)
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry);
+	dentry = dentry_put_kill(dentry);
 	if (dentry)
 		goto repeat;
 }
@@ -1078,10 +1105,8 @@ static void shrink_dentry_list(struct list_head *list)
 	struct dentry *dentry, *parent;
 
 	while (!list_empty(list)) {
-		struct inode *inode;
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 		spin_lock(&dentry->d_lock);
-		parent = lock_parent(dentry);
 
 		/*
 		 * The dispose list is isolated and dentries are not accounted
@@ -1089,15 +1114,13 @@ static void shrink_dentry_list(struct list_head *list)
 		 * here regardless of whether it is referenced or not.
 		 */
 		d_shrink_del(dentry);
-
+again:
 		/*
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup. Do not free it.
 		 */
 		if (dentry->d_lockref.count > 0) {
 			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
 			continue;
 		}
 
@@ -1105,23 +1128,14 @@ static void shrink_dentry_list(struct list_head *list)
 		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
 			bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
 			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
 			if (can_free)
 				dentry_free(dentry);
 			continue;
 		}
 
-		inode = dentry->d_inode;
-		if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
-			d_shrink_add(dentry, list);
-			spin_unlock(&dentry->d_lock);
-			if (parent)
-				spin_unlock(&parent->d_lock);
-			continue;
-		}
-
-		__dentry_kill(dentry);
+		parent = dentry_kill(dentry);
+		if (unlikely(IS_ERR(parent)))
+			goto again;
 
 		/*
 		 * We need to prune ancestors too. This is necessary to prevent
@@ -1131,7 +1145,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 */
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
+			dentry = dentry_put_kill(dentry);
 	}
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 0/6] fs/dcache: avoid trylock loops
  2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
                   ` (5 preceding siblings ...)
  2018-02-22 23:50 ` [PATCH v2 6/6] fs/dcache: Avoid remaining " John Ogness
@ 2018-02-23  0:59 ` Linus Torvalds
  6 siblings, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2018-02-23  0:59 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Thu, Feb 22, 2018 at 3:50 PM, John Ogness <john.ogness@linutronix.de> wrote:
>
> This patchset removes all trylock loops from within the dcache code.

I'm not hating it.

It may be because I got used (resigned?) to this from the discussion
around the previous version. But it looks ok to me.

But this is very much an area where I do want Al to ack it all, so
let's see if he's any happier now.

              Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete()
  2018-02-22 23:50 ` [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete() John Ogness
@ 2018-02-23  2:08   ` Al Viro
  0 siblings, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-02-23  2:08 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 12:50:22AM +0100, John Ogness wrote:

> The trylock loop can be avoided with functionality similar to
> lock_parent(). The fast path tries the trylock first, which is likely
> to succeed. In the contended case it attempts locking in the correct
> order. This requires to drop dentry->d_lock first, which allows
> another task to free d_inode.

Wait a minute.  _What_ allows another task to free ->d_inode on
a dentry we are holding a reference to?  Any place like that is
a serious bug - after all, what's to prevent the same place
doing that to dentry of an opened file, with obvious ugly
results.

That's the whole reason why d_delete() is *NOT* making dentry
negative when refcount is greater than 1 (i.e. when somebody
else is holding a reference).

Rules for ->d_inode:

* initially NULL.

* only changes under ->d_lock

* __dentry_kill() makes it NULL after dentry has been
	+ marked dead
	+ evicted from all lists except possibly shrink one.
  with ->d_lock held through all of that.  The only thing
  that can be done by anybody else with the ones stuck on
  shrink list is actually freeing them.

  Note that once __dentry_kill() is called, that's it - dentry
  is ours, for all practical purposes.  There'd better be no
  other references to that sucker and we make sure that no new
  ones will arise.

* prior to the call of __dentry_kill() any would-be changer
  of ->d_inode must be holding a reference to dentry.

* changes from non-NULL to NULL are possible only when there's
  nobody else holding references.

Changes from NULL to non-NULL _are_ possible (caller must be
holding a reference, but that's it).  However, feeding a negative
dentry to your dentry_lock_inode() is an instant oops - it won't
live to the point where you would recheck ->d_inode for changes.

So if you see any place where positive could be changed to negative
under us, we do have a problem.  Big one.

Refcount can change once we drop ->d_lock, but it can't get to zero -
our reference is still with us.

Note that ->d_parent *CAN* change, no matter how many references are
held.  That's what rcu games in lock_parent() are about - dentry
can be moved and ex-parent could've been freed if that was the last
reference.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill()
  2018-02-22 23:50 ` [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill() John Ogness
@ 2018-02-23  2:22   ` Al Viro
  2018-02-23  3:12     ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23  2:22 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 12:50:23AM +0100, John Ogness wrote:
>  static struct dentry *dentry_kill(struct dentry *dentry)
>  	__releases(dentry->d_lock)
>  {
> -	struct inode *inode = dentry->d_inode;
> -	struct dentry *parent = NULL;
> +	int saved_count = dentry->d_lockref.count;

	Umm...  How can that be not 1?  After all, fast_dput() should
never return false without ->d_lock being held *and* ->d_count being
equal to 1.

> +	/*
> +	 * d_inode might have changed if d_lock was temporarily
> +	 * dropped. If it changed it is necessary to start over
> +	 * because a wrong inode (or no inode) lock is held.
> +	 */

If it might have changed, we are fucked.

> +out_ref_changed:
> +	/*
> +	 * The refcount was incremented while dentry->d_lock was dropped.
> +	 * Just decrement the refcount, unlock, and tell the caller to
> +	 * stop the directory walk.
> +	 */
> +	if (!WARN_ON(dentry->d_lockref.count < 1))
> +		dentry->d_lockref.count--;
> +
>  	spin_unlock(&dentry->d_lock);
> -	return dentry; /* try again with same dentry */
> +
> +	return NULL;

No.  This is completely wrong.  If somebody else has found the sucker
while we dropped the lock and even got around to playing with refcount,
they might have done more than that.

In particular, they might have *dropped* their reference, after e.g.
picking it as our inode's alias and rehashed the fucker.  Making
our decision not to retain it no longer valid.  And your code will
not notice that.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill()
  2018-02-23  2:22   ` Al Viro
@ 2018-02-23  3:12     ` Al Viro
  2018-02-23  3:16       ` Al Viro
  2018-02-23  5:46       ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2018-02-23  3:12 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 02:22:43AM +0000, Al Viro wrote:

> No.  This is completely wrong.  If somebody else has found the sucker
> while we dropped the lock and even got around to playing with refcount,
> they might have done more than that.
> 
> In particular, they might have *dropped* their reference, after e.g.
> picking it as our inode's alias and rehashed the fucker.  Making
> our decision not to retain it no longer valid.  And your code will
> not notice that.

PS: I really wonder if we should treat the failure to trylock ->i_lock
and parent's ->d_lock at that point (we are already off the fast path
here) as
	* drop all spinlocks we'd got
	* grab ->i_lock
	* grab ->d_lock
	* lock_parent()
	* act as if fast_dput() has returned 0, only remember to drop ->i_lock
and unlock parent before dropping ->d_lock if we decide to keep it.

IOW, add

static inline bool retain_dentry(struct dentry *dentry)
{
        WARN_ON(d_in_lookup(dentry));

        /* Unreachable? Get rid of it */
        if (unlikely(d_unhashed(dentry)))
                return false;

        if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED))
                return false;

        if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
                if (dentry->d_op->d_delete(dentry))
                        return false;
        }

	dentry_lru_add(dentry);
	dentry->d_lockref.count--;
	return true;
}

then have dput() do
{
        if (unlikely(!dentry))
                return;
repeat:
        might_sleep();

        rcu_read_lock();
        if (likely(fast_dput(dentry))) {
                rcu_read_unlock();
                return;
        }

        /* Slow case: now with the dentry lock held */
        rcu_read_unlock();
	if (likely(retain_dentry(dentry))) {
		spin_unlock(&dentry->d_lock);
		return;
	}
	dentry = dentry_kill(dentry);
	if (dentry)
		goto repeat;
}

with dentry_kill() being pretty much as it is now, except that
it would be ending on

failed:
	spin_unlock(&dentry->d_lock);
	spin_lock(&inode->i_lock);
	spin_lock(&dentry->d_lock);
	parent = lock_parent(dentry);
	/* retain_dentry() needs ->count == 1 already checked)
	if (dentry->d_lockref.count == 1 && !retain_dentry(dentry)) {
		__dentry_kill(dentry);
		return parent;
	}
	/* we are keeping it, after all */
	if (inode)
		spin_unlock(&inode->i_lock);
	spin_unlock(&dentry->d_lock);
	if (parent)
		spin_unlock(&parent->d_lock);
	return NULL;
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill()
  2018-02-23  3:12     ` Al Viro
@ 2018-02-23  3:16       ` Al Viro
  2018-02-23  5:46       ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-02-23  3:16 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 03:12:14AM +0000, Al Viro wrote:

> 	/* retain_dentry() needs ->count == 1 already checked)

... obviously not even compile-tested ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
  2018-02-22 23:50 ` [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list() John Ogness
@ 2018-02-23  3:48   ` Al Viro
  0 siblings, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-02-23  3:48 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 12:50:24AM +0100, John Ogness wrote:
> -		while (dentry && !lockref_put_or_lock(&dentry->d_lockref)) {
> -			parent = lock_parent(dentry);
> -			if (dentry->d_lockref.count != 1) {
> -				dentry->d_lockref.count--;
> -				spin_unlock(&dentry->d_lock);
> -				if (parent)
> -					spin_unlock(&parent->d_lock);
> -				break;
> -			}
> -			inode = dentry->d_inode;	/* can't be NULL */
> -			if (unlikely(!spin_trylock(&inode->i_lock))) {
> -				spin_unlock(&dentry->d_lock);
> -				if (parent)
> -					spin_unlock(&parent->d_lock);
> -				cpu_relax();
> -				continue;
> -			}
> -			__dentry_kill(dentry);
> -			dentry = parent;
> -		}
> +		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
> +			dentry = dentry_kill(dentry);

Hmm...  OK, that's interesting.  I agree that it looks similar to dentry_kill()
loop, with one exception - here we are aggressively pruning the branch.  None
of the "do we want to retain that sucker" stuff here.  It doesn't matter for
most of the callers, with one exception: prune_dcache_sb().  OTOH, there it
just might be the right thing to do anyway - after all, it matters only if
somebody has grabbed and dropped the sucker while we'd been trying to do
lock_parent().  Had we lost the race with their dput(), we would've left
the damn thing alone, and we are called from a memory shrinker, so we'll get
called again if needed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-22 23:50 ` [PATCH v2 6/6] fs/dcache: Avoid remaining " John Ogness
@ 2018-02-23  3:58   ` Al Viro
  2018-02-23  4:08     ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23  3:58 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 12:50:25AM +0100, John Ogness wrote:

> Avoid the trylock loop by using dentry_kill(). When killing dentries
> from the dispose list, it is very similar to killing a dentry in
> dput(). The difference is that dput() expects to be the last user of
> the dentry (refcount=1) and will deref whereas shrink_dentry_list()
> expects there to be no user (refcount=0). In order to handle both
> situations with the same code, move the deref code from dentry_kill()
> into a new wrapper function dentry_put_kill(), which can be used
> by previous dentry_kill() users. Then shrink_dentry_list() can use
> the dentry_kill() to cleanup the dispose list.
> 
> This also has the benefit that the locking order is now the same.
> First the inode is locked, then the parent.

Current code moves the sucker to the end of list in that case; I'm not
at all sure that what you are doing will improve the situation at all...

You *still* have a trylock loop there - only it keeps banging at the
same dentry instead of going through the rest first...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-23  3:58   ` Al Viro
@ 2018-02-23  4:08     ` Al Viro
  2018-02-23 13:57       ` John Ogness
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23  4:08 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 03:58:14AM +0000, Al Viro wrote:
> On Fri, Feb 23, 2018 at 12:50:25AM +0100, John Ogness wrote:
> 
> > Avoid the trylock loop by using dentry_kill(). When killing dentries
> > from the dispose list, it is very similar to killing a dentry in
> > dput(). The difference is that dput() expects to be the last user of
> > the dentry (refcount=1) and will deref whereas shrink_dentry_list()
> > expects there to be no user (refcount=0). In order to handle both
> > situations with the same code, move the deref code from dentry_kill()
> > into a new wrapper function dentry_put_kill(), which can be used
> > by previous dentry_kill() users. Then shrink_dentry_list() can use
> > the dentry_kill() to cleanup the dispose list.
> > 
> > This also has the benefit that the locking order is now the same.
> > First the inode is locked, then the parent.
> 
> Current code moves the sucker to the end of list in that case; I'm not
> at all sure that what you are doing will improve the situation at all...
> 
> You *still* have a trylock loop there - only it keeps banging at the
> same dentry instead of going through the rest first...

Actually, it's even worse - _here_ you are dealing with something that
really can change inode under you.  This is one and only case where
we are kicking out a zero-refcount dentry without having already held
->i_lock.  At the very least, it's bloody different from regular
dentry_kill().  In this case, dentry itself is protected from freeing
by being on the shrink list - that's what makes __dentry_kill() to
leave the sucker allocated.   We are not holding references, it is
hashed and anybody could come, pick it, d_delete() it, etc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill()
  2018-02-23  3:12     ` Al Viro
  2018-02-23  3:16       ` Al Viro
@ 2018-02-23  5:46       ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-02-23  5:46 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 03:12:14AM +0000, Al Viro wrote:

> failed:
> 	spin_unlock(&dentry->d_lock);
> 	spin_lock(&inode->i_lock);
> 	spin_lock(&dentry->d_lock);
> 	parent = lock_parent(dentry);

Hmm...  Negative dentry case obviously is trickier - not to mention oopsen,
it might have become positive under us.  Bugger...  OTOH, it's not much
trickier - with negative dentry we can only fail on trying to lock the
parent, in which case we should just check that it's still negative before
killing it off.  If it has gone positive on us, we'll just unlock the
parent and we are back to the normal "positive dentry, only ->d_lock held"
case.  At most one retry there - once it's positive, it stays positive.
So,

static struct dentry *dentry_kill(struct dentry *dentry)
        __releases(dentry->d_lock)
{
	struct inode *inode = dentry->d_inode;
	struct dentry *parent = NULL;

	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
		goto no_locks;

	if (!IS_ROOT(dentry)) {
		parent = dentry->d_parent;
		if (unlikely(!spin_trylock(&parent->d_lock))) {
			if (inode) {
				spin_unlock(&inode->i_lock);
				goto no_locks;
			}
			goto need_parent;
		}
	}
kill_it:
	__dentry_kill(dentry);
	return parent;

no_locks:	/* positive, only ->d_lock held */
	spin_unlock(&dentry->d_lock);
	spin_lock(&inode->i_lock);
	spin_lock(&dentry->d_lock);
need_parent:
	parent = lock_parent(dentry);
	if (unlikely(dentry->d_lockref.count != 1 || retain_dentry(dentry))) {
		/* we are keeping it, after all */
		if (inode)
			spin_unlock(&inode->i_lock);
		spin_unlock(&dentry->d_lock);
		if (parent)
			spin_unlock(&parent->d_lock);
		return NULL;
	}
	/* it should die */
	if (inode)	/* was positive, ->d_inode unchanged, locks held */
		goto kill_it;
	inode = dentry->d_inode;	// READ_ONCE?
	if (!inode)	/* still negative, locks held */
		goto kill_it;
	/* negative became positive; it can't become negative again */
	if (parent)
		spin_unlock(&parent->d_lock);
	goto no_locks;	/* once */
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-23  4:08     ` Al Viro
@ 2018-02-23 13:57       ` John Ogness
  2018-02-23 15:09         ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-23 13:57 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

Hi Al,

I am responding to these comments first because they touch on some
fundemental reasons behind the motivation of the patchset.

On 2018-02-23, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>>> Avoid the trylock loop by using dentry_kill(). When killing dentries
>>> from the dispose list, it is very similar to killing a dentry in
>>> dput(). The difference is that dput() expects to be the last user of
>>> the dentry (refcount=1) and will deref whereas shrink_dentry_list()
>>> expects there to be no user (refcount=0). In order to handle both
>>> situations with the same code, move the deref code from dentry_kill()
>>> into a new wrapper function dentry_put_kill(), which can be used
>>> by previous dentry_kill() users. Then shrink_dentry_list() can use
>>> the dentry_kill() to cleanup the dispose list.
>>> 
>>> This also has the benefit that the locking order is now the same.
>>> First the inode is locked, then the parent.
>> 
>> Current code moves the sucker to the end of list in that case; I'm not
>> at all sure that what you are doing will improve the situation at all...
>> 
>> You *still* have a trylock loop there - only it keeps banging at the
>> same dentry instead of going through the rest first...

What I am doing is a loop, but it is not a trylock loop. The trylocks in
the code only exist as likely() optimization. What I am doing is
spinning on each lock I need in the correct locking order until I get
them all. This is quite different than looping until I chance to get the
locks I need by using only trylocks and in the incorrect locking order.

> Actually, it's even worse - _here_ you are dealing with something that
> really can change inode under you.  This is one and only case where we
> are kicking out a zero-refcount dentry without having already held
> ->i_lock.  At the very least, it's bloody different from regular
> dentry_kill().  In this case, dentry itself is protected from freeing
> by being on the shrink list - that's what makes __dentry_kill() to
> leave the sucker allocated.  We are not holding references, it is
> hashed and anybody could come, pick it, d_delete() it, etc.

Yes, and that is why the new dentry_lock_inode() and dentry_kill()
functions react to any changes in refcount and check for inode
changes. Obviously for d_delete() the helper functions are checking way
more than they need to. But if we've missed the trylock optimization
we're already in the unlikely case, so the extra checks _may_ be
acceptable in order to have simplified code. As Linus already pointed
out, the cost of spinning will likely overshadow the cost of a few
compares.

These patches consolidate the 4 trylocking loops into 3 helper
functions: dentry_lock_inode(), dentry_kill(), dentry_put_kill(). And
these helper functions base off one another. Through that consolidation,
all former trylock-loops are now spinning on locks in the correct
locking order. This will directly address the two problems that are
mentioned in the commit messages: PREEMPT_RT sleeping spinlocks with
priorities and current cpu_relax()/cond_resched() efforts to try to
handle bad luck in trylock loops.

Do you recommend I avoid consolidating the 4 trylock loops into the same
set of helper functions and instead handle them all separately (as is
the case in mainline)?

Or maybe the problem is how my patchset is assembling the final
result. If patch 3 and 4 were refined to address your concerns about
them but then by the end of the 6th patch we still end up where we are
now, is that something that is palatable?

IOW, do the patches only need (possibly a lot of) refinement or do you
consider this approach fundamentally flawed?

John Ogness

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-23 13:57       ` John Ogness
@ 2018-02-23 15:09         ` Al Viro
  2018-02-23 17:42           ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23 15:09 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 02:57:23PM +0100, John Ogness wrote:

> > Actually, it's even worse - _here_ you are dealing with something that
> > really can change inode under you.  This is one and only case where we
> > are kicking out a zero-refcount dentry without having already held
> > ->i_lock.  At the very least, it's bloody different from regular
> > dentry_kill().  In this case, dentry itself is protected from freeing
> > by being on the shrink list - that's what makes __dentry_kill() to
> > leave the sucker allocated.  We are not holding references, it is
> > hashed and anybody could come, pick it, d_delete() it, etc.
> 
> Yes, and that is why the new dentry_lock_inode() and dentry_kill()
> functions react to any changes in refcount and check for inode
> changes. Obviously for d_delete() the helper functions are checking way
> more than they need to. But if we've missed the trylock optimization
> we're already in the unlikely case, so the extra checks _may_ be
> acceptable in order to have simplified code. As Linus already pointed
> out, the cost of spinning will likely overshadow the cost of a few
> compares.

It's not that you are checking extra things - you are checking the wrong
things.  "Refcount has returned to original" is useless.

> Do you recommend I avoid consolidating the 4 trylock loops into the same
> set of helper functions and instead handle them all separately (as is
> the case in mainline)?
> 
> Or maybe the problem is how my patchset is assembling the final
> result. If patch 3 and 4 were refined to address your concerns about
> them but then by the end of the 6th patch we still end up where we are
> now, is that something that is palatable?

No.  The place where you end up with dput() is flat-out wrong.

> IOW, do the patches only need (possibly a lot of) refinement or do you
> consider this approach fundamentally flawed?

You are conflating the "we have a reference" cases with this one, and
they are very different.  Note, BTW, that had we raced with somebody
else grabbing a reference, we would've quietly dropped dentry from
the shrink list; what if we do the following: just after checking that
refcount is not positive, do
	inode = dentry->d_inode;
	if unlikely(inode && !spin_trylock...)
		rcu_read_lock
		drop ->d_lock
		grab inode->i_lock
		grab ->d_lock
		if unlikely(dentry->d_inode != inode)
			drop inode->i_lock
			rcu_read_unlock
			if !killed
				drop ->d_lock
				drop parent's ->d_lock
				continue;
		else
			rcu_read_unlock
*before* going into
                if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
                        bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
                        spin_unlock(&dentry->d_lock);
			...
part?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()
  2018-02-23 15:09         ` Al Viro
@ 2018-02-23 17:42           ` Al Viro
  2018-02-23 20:13             ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23 17:42 UTC (permalink / raw)
  To: John Ogness
  Cc: linux-fsdevel, Linus Torvalds, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	linux-kernel

On Fri, Feb 23, 2018 at 03:09:28PM +0000, Al Viro wrote:
> You are conflating the "we have a reference" cases with this one, and
> they are very different.  Note, BTW, that had we raced with somebody
> else grabbing a reference, we would've quietly dropped dentry from
> the shrink list; what if we do the following: just after checking that
> refcount is not positive, do
> 	inode = dentry->d_inode;
> 	if unlikely(inode && !spin_trylock...)
> 		rcu_read_lock
> 		drop ->d_lock
> 		grab inode->i_lock
> 		grab ->d_lock
> 		if unlikely(dentry->d_inode != inode)
> 			drop inode->i_lock
> 			rcu_read_unlock
> 			if !killed
> 				drop ->d_lock
> 				drop parent's ->d_lock
> 				continue;
> 		else
> 			rcu_read_unlock
> *before* going into
>                 if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
>                         bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
>                         spin_unlock(&dentry->d_lock);
> 			...
> part?

Owww....  It's actually even nastier than I realized - dropping ->d_lock
opens us to having the sucker freed by dput() from another thread here.
IOW, between d_shrink_del(dentry) and __dentry_kill(dentry) dropping ->d_lock
is dangerous...

It's really very different from all other cases, and the trickiest by far.

FWIW, my impression from the series:
	1) dentry_kill() should deal with trylock failures on its own, leaving
the callers only the real "we need to drop the parent" case.  See upthread for
one variant of doing that.
	2) switching parent eviction in shrink_dentry_list() to dentry_kill()
is fine.
	3) for d_delete() trylock loop is wrong; however, it does not need
anything more elaborate than
{
        struct inode *inode;
        int isdir = d_is_dir(dentry);
        /*
         * Are we the only user?
         */
        spin_lock(&dentry->d_lock);
        if (dentry->d_lockref.count != 1)
		goto Shared;

        inode = dentry->d_inode;
	if (unlikely(!spin_trylock(&inode->i_lock))) {
		spin_unlock(&dentry->d_lock);
		spin_lock(&inode->i_lock);
		spin_lock(&dentry->d_lock);
		if (dentry->d_lockref.count != 1) {
			spin_unlock(&inode->i_lock);
			goto Shared;
		}
	}
           
	dentry->d_flags &= ~DCACHE_CANT_MOUNT;
	dentry_unlink_inode(dentry);
	fsnotify_nameremove(dentry, isdir);
	return;

Shared:	/* can't make it negative, must unhash */
        if (!d_unhashed(dentry))
                __d_drop(dentry);
        spin_unlock(&dentry->d_lock);

        fsnotify_nameremove(dentry, isdir);
}

If not an outright "lock inode first from the very beginning" - note that
inode is stable (and non-NULL) here.  IOW, that needs to be compared with
{
        struct inode *inode = dentry->d_inode;
        int isdir = d_is_dir(dentry);
        spin_lock(&inode->i_lock);
        spin_lock(&dentry->d_lock);
        /*
         * Are we the only user?
         */
        if (dentry->d_lockref.count == 1) {
		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
		dentry_unlink_inode(dentry);
	} else {
		if (!d_unhashed(dentry))
			__d_drop(dentry);
		spin_unlock(&dentry->d_lock);
		spin_unlock(&inode->i_lock);
	}
	fsnotify_nameremove(dentry, isdir);
}

That costs an extra boinking the ->i_lock in case dentry is shared, but it's
much shorter and simpler that way.  Needs profiling; if the second variant
does not give worse performance, I would definitely prefer that one.
	4) the nasty one - shrink_dentry_list() evictions of zero-count dentries.
_That_ calls for careful use of RCU, etc. - none of the others need that.  Need
to think how to deal with that sucker; in any case, I do not believe that sharing
said RCU use, etc. with any other cases would do anything other than obfuscating
the rest.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())
  2018-02-23 17:42           ` Al Viro
@ 2018-02-23 20:13             ` Al Viro
  2018-02-23 21:35               ` Linus Torvalds
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-23 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior, linux-kernel,
	John Ogness

On Fri, Feb 23, 2018 at 05:42:16PM +0000, Al Viro wrote:

> 	4) the nasty one - shrink_dentry_list() evictions of zero-count dentries.
> _That_ calls for careful use of RCU, etc. - none of the others need that.  Need
> to think how to deal with that sucker; in any case, I do not believe that sharing
> said RCU use, etc. with any other cases would do anything other than obfuscating
> the rest.

Arrrgh...  Actually, there's a nasty corner case where the variant in mainline is
broken.  Look:
	dentry placed on a shrink list
	we pick the fucker from the list and lock it.
	we call lock_parent() on it.
		dentry is not a root and it's not deleted, so we proceed.
		trylock fails.
		we grab rcu_read_lock()
		we drop dentry->d_lock
	on another CPU, something does e.g. d_prune_aliases() (or finds the
	sucker in hash and proceeds to unhash and dput(), etc.) - anything
	that evicts that dentry.  It is marked with DCACHE_MAY_FREE and left
	alone.  The parent, OTOH, is dropped and freeing gets scheduled as
	soon as RCU allows.
		we grab parent->d_lock
		we verify that dentry->d_parent is still the same (it is)
		we do rcu_read_unlock()
		we grab dentry->d_lock
		we return parent
At that point we are fucked - there's nothing to prevent parent from being
freed at any point.  And we assume that its ->d_lock is held and needs to
be dropped.

The call site in d_prune_aliases() avoids the same scenario, since there we
are already holding ->i_lock and another thread won't get to __dentry_kill()
until we are done with lock_parent().

Unless I'm missing something, that's a (narrow) memory corruptor.  The window is
narrow, but not impossibly so - if that other thread had been spinning on attempt
to grab dentry->d_lock in d_prune_alias(), it has to squeeze through
                if (!dentry->d_lockref.count) {
and then in lock_parent() called there - through
        if (IS_ROOT(dentry))
                return NULL;
        if (unlikely(dentry->d_lockref.count < 0))
                return NULL;
        if (likely(spin_trylock(&parent->d_lock)))
before the first CPU gets through
        parent = READ_ONCE(dentry->d_parent);
        spin_lock(&parent->d_lock);

The first CPU can't be preempted, but there's nothing to prevent an IRQ arriving
at that point, letting the second one win the race.

Comments?

I think the (untested) patch below is -stable fodder:

lock_parent() needs to recheck if dentry got __dentry_kill'ed under it

In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..32aaab21e648 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -647,11 +647,16 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
 		spin_unlock(&parent->d_lock);
 		goto again;
 	}
-	rcu_read_unlock();
-	if (parent != dentry)
+	if (parent != dentry) {
 		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-	else
+		if (unlikely(dentry->d_lockref.count < 0)) {
+			spin_unlock(&parent->d_lock);
+			parent = NULL;
+		}
+	} else {
 		parent = NULL;
+	}
+	rcu_read_unlock();
 	return parent;
 }
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())
  2018-02-23 20:13             ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Al Viro
@ 2018-02-23 21:35               ` Linus Torvalds
  2018-02-24  0:22                 ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Linus Torvalds @ 2018-02-23 21:35 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, John Ogness

On Fri, Feb 23, 2018 at 12:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Look:
>         dentry placed on a shrink list
>         we pick the fucker from the list and lock it.
>         we call lock_parent() on it.
>                 dentry is not a root and it's not deleted, so we proceed.
>                 trylock fails.
>                 we grab rcu_read_lock()
>                 we drop dentry->d_lock

[ deleted the bad things ]

Should we just instead get the ref to the dentry before dropping the
lock, so that nobody else can get to dentry_kill?

This is too subtle, and your fix to check d_lockref.count < 0 sounds
wrong to me. If it's really gone, maybe it has been reused and the
refcount is positive again, but it's something else than a dentry
entirely?

Hmm.

No, you extended the rcu read section, so I guess your patch is fine.
And lock_parent already has that pattern, soiit's not new.

Ok, I agree, looks like lock_parent should just re-check that thing
that it already checked earler, but that now might be true again
because of we dropped d_lock.

              Linus

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())
  2018-02-23 21:35               ` Linus Torvalds
@ 2018-02-24  0:22                 ` Al Viro
  2018-02-25  7:40                   ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-02-24  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, John Ogness

On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:

> This is too subtle, and your fix to check d_lockref.count < 0 sounds
> wrong to me. If it's really gone, maybe it has been reused and the
> refcount is positive again, but it's something else than a dentry
> entirely?
> 
> Hmm.
> 
> No, you extended the rcu read section, so I guess your patch is fine.
> And lock_parent already has that pattern, soiit's not new.
> 
> Ok, I agree, looks like lock_parent should just re-check that thing
> that it already checked earler, but that now might be true again
> because of we dropped d_lock.

IMO that's the right thing for backports; whether we keep it after
the getting rid of trylock loops is a different question.  Note that
the only case where we do not have __dentry_kill() prevention
guaranteed by the caller (either by holding a reference, or by
holding onto ->i_lock all along) is in shrink_dentry_list().
And there we have more than enough of other subtle crap.

Moreover, there we have a good reason to treat "it had been moved"
as "kick it off the shrink list and free if it's already dead",
which might simplify the things.  Below is a stab at that:

/*
 * ONLY for shrink_dentry_list() - it returns false if it finds
 * dentry grabbed, moved or killed, which is fine there but not
 * anywhere else.  OTOH, nobody else needs to deal with dentries
 * getting killed under them.
 */
static bool shrink_lock_for_kill(struct dentry *dentry)
{
	if (dentry->d_lockref.count)
		return false;

	inode = dentry->d_inode;
	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
		rcu_read_lock();	/* to protect inode */
		spin_unlock(&dentry->d_lock);
		spin_lock(&inode->i_lock);
		spin_lock(&dentry->d_lock);
		if (unlikely(dentry->d_lockref.count))
			goto out;
		/* changed inode means that somebody had grabbed it */
		if (unlikely(inode != dentry->d_inode))
			goto out;
		rcu_read_unlock();
	}

	parent = dentry->d_parent;
	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
		return true;
	
	rcu_read_lock();		/* to protect parent */
	spin_unlock(&dentry->d_lock);
	parent = READ_ONCE(dentry->d_parent);
	spin_lock(&parent->d_lock);
	if (unlikely(parent != dentry->d_parent)) {
		spin_unlock(&parent->d_lock);
		spin_lock(&dentry->d_lock);
		goto out;
	}
	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	if (likely(!dentry->d_lockref.count)) {
		rcu_read_unlock();
		return true;
	}
	spin_unlock(&parent->d_lock);
out:
	spin_unlock(&inode->i_lock);
	rcu_read_unlock();
	return false;
}

static void shrink_dentry_list(struct list_head *list)
{
	struct dentry *dentry, *parent;

	while (!list_empty(list)) {
		struct inode *inode;
		dentry = list_entry(list->prev, struct dentry, d_lru);
		spin_lock(&dentry->d_lock);
		if (!shrink_lock_for_kill(dentry)) {
			bool can_free = false;
			d_shrink_del(dentry);
			if (dentry->d_lockref.count < 0)
				can_free = dentry->d_flags & DCACHE_MAY_FREE;
			spin_unlock(&dentry->d_lock);
			if (can_free)
				dentry_free(dentry);
			continue;
		}
		d_shrink_del(dentry);
		parent = dentry->d_parent;
		__dentry_kill(dentry);
		if (dentry == parent)
			continue;
		dentry = parent;
		.... 
		same as now
		....
	}
}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())
  2018-02-24  0:22                 ` Al Viro
@ 2018-02-25  7:40                   ` Al Viro
  2018-02-27  5:16                     ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) John Ogness
  2018-03-02  9:04                     ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Sebastian Andrzej Siewior
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2018-02-25  7:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christoph Hellwig, Thomas Gleixner,
	Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, John Ogness

On Sat, Feb 24, 2018 at 12:22:48AM +0000, Al Viro wrote:
> On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:
> 
> > This is too subtle, and your fix to check d_lockref.count < 0 sounds
> > wrong to me. If it's really gone, maybe it has been reused and the
> > refcount is positive again, but it's something else than a dentry
> > entirely?
> > 
> > Hmm.
> > 
> > No, you extended the rcu read section, so I guess your patch is fine.
> > And lock_parent already has that pattern, soiit's not new.
> > 
> > Ok, I agree, looks like lock_parent should just re-check that thing
> > that it already checked earler, but that now might be true again
> > because of we dropped d_lock.
> 
> IMO that's the right thing for backports; whether we keep it after
> the getting rid of trylock loops is a different question.  Note that
> the only case where we do not have __dentry_kill() prevention
> guaranteed by the caller (either by holding a reference, or by
> holding onto ->i_lock all along) is in shrink_dentry_list().
> And there we have more than enough of other subtle crap.
> 
> Moreover, there we have a good reason to treat "it had been moved"
> as "kick it off the shrink list and free if it's already dead",
> which might simplify the things.  Below is a stab at that:

FWIW, a variant of that series is in #work.dcache; it's
almost certainly not the final (I want to clean the things up
and probably reorder them as well) and it needs one hell of
profiling and review.  It seems to work, and I don't see any
obvious performance regressions (everything seems to be within
normal noise), but I'd like it beaten up a lot more.

Parts are from John's series, parts are rewritten as discussed
upthread.  As the result, trylock loops are gone and no new
retry loops had been added in their place - lock_parent() still
has one, but that's it.

I'll play with cleaning up and reordering tomorrow after I get
some sleep.  In the meanwhile, the current state of that set is at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

^ permalink raw reply	[flat|nested] 46+ messages in thread

* dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-02-25  7:40                   ` Al Viro
@ 2018-02-27  5:16                     ` John Ogness
  2018-03-12 19:13                       ` Al Viro
  2018-03-02  9:04                     ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Sebastian Andrzej Siewior
  1 sibling, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-02-27  5:16 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On 2018-02-25, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> I'll play with cleaning up and reordering tomorrow after I get some
> sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I have one comment on your new dentry_kill()...

> From e77db95d1ad189c76bcda7b90ac2225e98d776a2 Mon Sep 17 00:00:00 2001
> From: Al Viro <viro@zeniv.linux.org.uk>
> Date: Fri, 23 Feb 2018 21:25:42 -0500
> Subject: [PATCH] get rid of trylock loop around dentry_kill()
>
> In case when trylock in there fails, deal with it directly in
> dentry_kill().  Note that in cases when we drop and retake
> ->d_lock, we need to recheck whether to retain the dentry.
> Another thing is that dropping/retaking ->d_lock might have
> ended up with negative dentry turning into positive; that,
> of course, can happen only once...
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c | 34 +++++++++++++++++++++++++++-------
>  1 file changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 449c1a5..c1f082d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -657,23 +657,43 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>  	struct dentry *parent = NULL;
>  
>  	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
> -		goto failed;
> +		goto slow_positive;
>  
>  	if (!IS_ROOT(dentry)) {
>  		parent = dentry->d_parent;
>  		if (unlikely(!spin_trylock(&parent->d_lock))) {
> -			if (inode)
> -				spin_unlock(&inode->i_lock);
> -			goto failed;
> +			parent = __lock_parent(dentry);
> +			if (likely(inode || !dentry->d_inode))
> +				goto got_locks;
> +			/* negative that became positive */
> +			if (parent)
> +				spin_unlock(&parent->d_lock);
> +			inode = dentry->d_inode;
> +			goto slow_positive;
>  		}
>  	}
> -
>  	__dentry_kill(dentry);
>  	return parent;
>  
> -failed:
> +slow_positive:
> +	spin_unlock(&dentry->d_lock);
> +	spin_lock(&inode->i_lock);
> +	spin_lock(&dentry->d_lock);
> +	parent = lock_parent(dentry);
> +got_locks:
> +	if (likely(dentry->d_lockref.count == 1 && !retain_dentry(dentry))) {
> +		__dentry_kill(dentry);
> +		return parent;
> +	}
> +	/* we are keeping it, after all */
> +	if (inode)
> +		spin_unlock(&inode->i_lock);
> +	if (parent)
> +		spin_unlock(&parent->d_lock);

If someone else has grabbed a reference, it shouldn't be added to the
lru list. Only decremented.

if (entry->d_lockref.count == 1)

> +	dentry_lru_add(dentry);
> +	dentry->d_lockref.count--;
>  	spin_unlock(&dentry->d_lock);
> -	return dentry; /* try again with same dentry */
> +	return NULL;
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())
  2018-02-25  7:40                   ` Al Viro
  2018-02-27  5:16                     ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) John Ogness
@ 2018-03-02  9:04                     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 46+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-03-02  9:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List,
	John Ogness

On 2018-02-25 07:40:05 [+0000], Al Viro wrote:
> FWIW, a variant of that series is in #work.dcache; it's
> almost certainly not the final (I want to clean the things up
> and probably reorder them as well) and it needs one hell of
> profiling and review.  It seems to work, and I don't see any
> obvious performance regressions (everything seems to be within
> normal noise), but I'd like it beaten up a lot more.

If you have something in mind, I could fire up something.

> I'll play with cleaning up and reordering tomorrow after I get
> some sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I've put it in -RT, let it do some upgrades and other things and nothing
bad happened so far.

Sebastian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-02-27  5:16                     ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) John Ogness
@ 2018-03-12 19:13                       ` Al Viro
  2018-03-12 20:05                         ` Al Viro
                                           ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Al Viro @ 2018-03-12 19:13 UTC (permalink / raw)
  To: John Ogness
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:

> If someone else has grabbed a reference, it shouldn't be added to the
> lru list. Only decremented.
> 
> if (entry->d_lockref.count == 1)

Nah, better handle that in retain_dentry() itself.  See updated
#work.dcache.

Note: another potentially fun thing in that branch is that I've
finally decided to bite the bullet and make __d_move() preserve
->d_parent of target.

Mainline:
al@sonny:/tmp$ touch d
al@sonny:/tmp$ sleep 100 >/tmp/a/b/c &
[1] 16487
al@sonny:/tmp$ ls -l /proc/16487/fd
total 0
lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c
lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c 
al@sonny:/tmp$ ls -l /proc/16487/fd
total 0
lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted)
lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13

With that branch:
root@kvm1:/tmp# touch d
root@kvm1:/tmp# sleep 100 >/tmp/a/b/c &
[1] 2263
root@kvm1:/tmp# ls -l /proc/2263/fd
total 0
lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c
lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c
root@kvm1:/tmp# ls -l /proc/2263/fd
total 0
lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)'
lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0

It doesn't come quite for free; cross-directory d_move()
and d_exchange() callers are responsible for having both
parents pinned (all of them do that, mostly since the usual
sequence is "look parents up, lock_rename(), *then* look
children up, then do renaming"; those that are not part of
rename(2) are also OK) and d_splice_alias() has become potentially
blocking in one case.  AFAICS, none of the callers is in
locking environment that would not allow that.  Survives
the local beating and doesn't seem to cause any performance
regressions.

What we get out of that is
	a) much saner semantics for d_move() et.al.
	b) saner behaviour of d_path() (see above)
	c) dentry can be IS_ROOT only if it has been
such all along; that simplifies the hell out of analysis.

FWIW, there's another trylock loop on dentries - one in
autofs get_next_positive_dentry().  Any plans re dealing
with that one?

I'd spent the last couple of weeks (when not being too sick
for any work) going through dcache.c and related code; hopefully
this time I will get the documentation into postable shape ;-/

There's an unpleasant area around the ->s_root vs. NFS.  There's
code that makes assumptions about ->s_root that are simply not true
for NFS.  Is path_connected() correct wrt NFS multiple imports from
the same server?  Ditto for mnt_already_visible() (that one might
be mitigated at the moment, but probably won't last).  Eric, am
I missing something subtle in there?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 19:13                       ` Al Viro
@ 2018-03-12 20:05                         ` Al Viro
  2018-03-12 20:33                           ` Al Viro
  2018-03-13  1:12                           ` NeilBrown
  2018-03-12 20:23                         ` Eric W. Biederman
                                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2018-03-12 20:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote:

> There's an unpleasant area around the ->s_root vs. NFS.  There's
> code that makes assumptions about ->s_root that are simply not true
> for NFS.  Is path_connected() correct wrt NFS multiple imports from
> the same server?  Ditto for mnt_already_visible() (that one might
> be mitigated at the moment, but probably won't last).  Eric, am
> I missing something subtle in there?

BTW, another thing spotted while trying to document the stuff:
Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle
side effect I hadn't spotted when applying.  Namely, for
DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have
d_unhashed() true.  That makes d_find_alias() logics around
discon_alias dead code:
                if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
                        if (IS_ROOT(alias) &&
                            (alias->d_flags & DCACHE_DISCONNECTED)) {
                                discon_alias = alias;
                        } else {
                                __dget_dlock(alias);
                                spin_unlock(&alias->d_lock);
                                return alias;
                        }
                }

Directory case is a red herring - we'll end up picking one and only
alias, whether it's IS_ROOT/disconnected or not.  For non-directories,
though, IS_ROOT and disconnected implies d_unhashed() now, so we might
as well turn that into
	if directory
		return d_find_any_alias
	else
		return the first hashed alias

Does any of the d_find_alias() callers want those dentries?  That is,
non-directories from d_obtain_alias() still not attached to a parent;
note that exportfs_decode_fh() is *not* one of such places - we don't
use d_find_alias() there at all.  If there's no such caller, we can
bloody well just drop the discon_alias logics and be done with that;
if there is, that commit has introduced a bug.  I might have missed
a part of threads related to that patch, so my apologies if it had
been discussed.

Neil, what's the situation there?  A lot of those callers clearly treat
the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
it looks like the change might have been the right thing, but it sure
as hell shouldn't be an undocumented side effect...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 19:13                       ` Al Viro
  2018-03-12 20:05                         ` Al Viro
@ 2018-03-12 20:23                         ` Eric W. Biederman
  2018-03-12 20:39                           ` Al Viro
  2018-03-12 22:14                         ` Thomas Gleixner
  2018-03-13 20:46                         ` John Ogness
  3 siblings, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2018-03-12 20:23 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:
>
>> If someone else has grabbed a reference, it shouldn't be added to the
>> lru list. Only decremented.
>> 
>> if (entry->d_lockref.count == 1)
>
> Nah, better handle that in retain_dentry() itself.  See updated
> #work.dcache.
>
> Note: another potentially fun thing in that branch is that I've
> finally decided to bite the bullet and make __d_move() preserve
> ->d_parent of target.
>
> Mainline:
> al@sonny:/tmp$ touch d
> al@sonny:/tmp$ sleep 100 >/tmp/a/b/c &
> [1] 16487
> al@sonny:/tmp$ ls -l /proc/16487/fd
> total 0
> lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
> l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c
> lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
> al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c 
> al@sonny:/tmp$ ls -l /proc/16487/fd
> total 0
> lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13
> l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted)
> lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13
>
> With that branch:
> root@kvm1:/tmp# touch d
> root@kvm1:/tmp# sleep 100 >/tmp/a/b/c &
> [1] 2263
> root@kvm1:/tmp# ls -l /proc/2263/fd
> total 0
> lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
> l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c
> lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
> root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c
> root@kvm1:/tmp# ls -l /proc/2263/fd
> total 0
> lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0
> l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)'
> lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0
>
> It doesn't come quite for free; cross-directory d_move()
> and d_exchange() callers are responsible for having both
> parents pinned (all of them do that, mostly since the usual
> sequence is "look parents up, lock_rename(), *then* look
> children up, then do renaming"; those that are not part of
> rename(2) are also OK) and d_splice_alias() has become potentially
> blocking in one case.  AFAICS, none of the callers is in
> locking environment that would not allow that.  Survives
> the local beating and doesn't seem to cause any performance
> regressions.
>
> What we get out of that is
> 	a) much saner semantics for d_move() et.al.
> 	b) saner behaviour of d_path() (see above)
> 	c) dentry can be IS_ROOT only if it has been
> such all along; that simplifies the hell out of analysis.
>
> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?
>
> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/
>
> There's an unpleasant area around the ->s_root vs. NFS.  There's
> code that makes assumptions about ->s_root that are simply not true
> for NFS.  Is path_connected() correct wrt NFS multiple imports from
> the same server? Ditto for mnt_already_visible() (that one might
> be mitigated at the moment, but probably won't last).  Eric, am
> I missing something subtle in there?

I don't have the entire context in my head.  But I don't think we
have problems today.

NFS before it uses paths from an unconnected root in the rest of
the vfs walks those paths backwards and makes the paths connect.
I don't remember where all of that code that performs those connections
but I do remember the code in fs/fhandle.c shares that code with nfs, to
perform the same operation in open_by_handle_at.

So I don't think the nfs peculiarities are actually relevant to
anything on an ordinary code path.


Of the two code paths you are concert about:

For path path_connected looking at s_root is a heuristic to avoid
calling is_subdir every time we need to do that check.  If the heuristic
fails we still have is_subdir which should remain accurate.  If
is_subdir fails the path is genuinely not connected at that moment
and failing is the correct thing to do.


For mnt_too_revealing the only filesystems under consideration are
proc and sysfs.   So nfs oddities are of no consequence.
mnt_too_revealing probably won't be extended to other filesystems.
Certainly nfs is not a candidate for having setting SB_I_USERNS_VISIBLE.


Al is that sufficient to address your concerns?

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 20:05                         ` Al Viro
@ 2018-03-12 20:33                           ` Al Viro
  2018-03-13  1:12                           ` NeilBrown
  1 sibling, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-03-12 20:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Mon, Mar 12, 2018 at 08:05:41PM +0000, Al Viro wrote:

> Does any of the d_find_alias() callers want those dentries?  That is,
> non-directories from d_obtain_alias() still not attached to a parent;
> note that exportfs_decode_fh() is *not* one of such places - we don't
> use d_find_alias() there at all.  If there's no such caller, we can
> bloody well just drop the discon_alias logics and be done with that;
> if there is, that commit has introduced a bug.  I might have missed
> a part of threads related to that patch, so my apologies if it had
> been discussed.
> 
> Neil, what's the situation there?  A lot of those callers clearly treat
> the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
> it looks like the change might have been the right thing, but it sure
> as hell shouldn't be an undocumented side effect...

FWIW, callers are
drivers/staging/lustre/lustre/llite/llite_lib.c:2448:           dentry = d_find_alias(page->mapping->host);
	definitely doesn't want them
fs/ceph/caps.c:2945:    while ((dn = d_find_alias(inode))) {
fs/ceph/mds_client.c:1930:      dentry = d_find_alias(inode);
	ditto - it's building a pathname.
fs/ceph/mds_client.c:2910:      dentry = d_find_alias(inode);
	ditto.
fs/ceph/mds_client.c:3374:      parent = d_find_alias(inode);
	clearly at least expects a directory (feeds to d_lookup(), etc.)
fs/coda/cache.c:112:    alias_de = d_find_alias(inode);
	directory-only
fs/fat/namei_vfat.c:736:        alias = d_find_alias(inode);
	we will reject any IS_ROOT there anyway
fs/fs-writeback.c:2071:         dentry = d_find_alias(inode);
	wants a name, these don't have one.
fs/fuse/dir.c:966:      dir = d_find_alias(parent);
	directory-only
fs/hostfs/hostfs_kern.c:129:    dentry = d_find_alias(ino);
	wants a pathname
fs/nilfs2/super.c:931:          dentry = d_find_alias(inode);
	directory-only
fs/nilfs2/super.c:1025:                 dentry = d_find_alias(inode);
	bloody better be a directory (root one, at that)
fs/xfs/xfs_filestream.c:293:    dentry = d_find_alias(inode);
	no parent, no love.
mm/shmem.c:3435:                dentry = d_find_alias(inode);
	no d_obtain_alias() on that one
security/commoncap.c:391:       dentry = d_find_alias(inode);
security/lsm_audit.c:295:               dentry = d_find_alias(inode);
	wants a name
security/selinux/hooks.c:1536:                  dentry = d_find_alias(inode);
security/selinux/hooks.c:1646:                          dentry = d_find_alias(inode);

So all but four of them clearly do *not* want those suckers.  And remaining
ones are in
	* ceph invalidate_aliases()
	* cap_inode_getsecurity()
	* selinux inode_doinit_with_dentry() (two call sites, very alike)
Do any of those want that stuff?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 20:23                         ` Eric W. Biederman
@ 2018-03-12 20:39                           ` Al Viro
  2018-03-12 23:28                             ` Eric W. Biederman
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-03-12 20:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote:

> Of the two code paths you are concert about:
> 
> For path path_connected looking at s_root is a heuristic to avoid
> calling is_subdir every time we need to do that check.  If the heuristic
> fails we still have is_subdir which should remain accurate.  If
> is_subdir fails the path is genuinely not connected at that moment
> and failing is the correct thing to do.
 
Umm...  That might be not good enough - the logics is "everything's
reachable from ->s_root anyway, so we might as well not bother checking".
For NFS it's simply not true.

We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b
and we'll have ->s_root pointing to a subtree of what's reachable at
/tmp/b.  Play with renames under /tmp/b and you just might end up with
a problem.  And mount on /tmp/a will be (mistakenly) considered to
be safe, since it satisfies the heuristics in path_connected().

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 19:13                       ` Al Viro
  2018-03-12 20:05                         ` Al Viro
  2018-03-12 20:23                         ` Eric W. Biederman
@ 2018-03-12 22:14                         ` Thomas Gleixner
  2018-03-13 20:46                         ` John Ogness
  3 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2018-03-12 22:14 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Mon, 12 Mar 2018, Al Viro wrote:
> On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote:
> What we get out of that is
> 	a) much saner semantics for d_move() et.al.
> 	b) saner behaviour of d_path() (see above)
> 	c) dentry can be IS_ROOT only if it has been
> such all along; that simplifies the hell out of analysis.

Nice.

> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?

It's on our list and we wanted to wait for the final resolution of dcache
before tackling it. Do you have any immediate ide about that?

> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/

Documentation is surely welcome. Paging that code back in is a mindboggling
exercise.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 20:39                           ` Al Viro
@ 2018-03-12 23:28                             ` Eric W. Biederman
  2018-03-12 23:52                               ` Eric W. Biederman
  2018-03-13  0:36                               ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Eric W. Biederman @ 2018-03-12 23:28 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote:
>
>> Of the two code paths you are concert about:
>> 
>> For path path_connected looking at s_root is a heuristic to avoid
>> calling is_subdir every time we need to do that check.  If the heuristic
>> fails we still have is_subdir which should remain accurate.  If
>> is_subdir fails the path is genuinely not connected at that moment
>> and failing is the correct thing to do.
>  
> Umm...  That might be not good enough - the logics is "everything's
> reachable from ->s_root anyway, so we might as well not bother checking".
> For NFS it's simply not true.

If I am parsing the code correctly path_connected is broken for nfsv2
and nfsv3 when NFS_MOUNT_UNSHARED is not set.   nfsv4 appears to make
a kernel mount of the real root of the filesystem properly setting
s_root and then finds the child it is mounting.

> We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b
> and we'll have ->s_root pointing to a subtree of what's reachable at
> /tmp/b.  Play with renames under /tmp/b and you just might end up with
> a problem.  And mount on /tmp/a will be (mistakenly) considered to
> be safe, since it satisfies the heuristics in path_connected().

Agreed.

Which means that if you mount server:/foo/bar/baz first and then
mount server:/foo with an appropriate rename you might be able
to see all of server:/foo or possibly server:/

Hmm..

Given that nfs_kill_super calls generic_shutdown_super and
generic_shutdown_super calls shrink_dcache_for_umount
I would argue that nfsv2 and nfsv3 are buggy in the same
case, as shrink_dcache_for_umount is called on something
that is not the root of the filesystem's dentry tree.

Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 23:28                             ` Eric W. Biederman
@ 2018-03-12 23:52                               ` Eric W. Biederman
  2018-03-13  0:37                                 ` Al Viro
  2018-03-13  0:36                               ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) Al Viro
  1 sibling, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2018-03-12 23:52 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

ebiederm@xmission.com (Eric W. Biederman) writes:

> Al Viro <viro@ZenIV.linux.org.uk> writes:
>
>> On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote:
>>
>>> Of the two code paths you are concert about:
>>> 
>>> For path path_connected looking at s_root is a heuristic to avoid
>>> calling is_subdir every time we need to do that check.  If the heuristic
>>> fails we still have is_subdir which should remain accurate.  If
>>> is_subdir fails the path is genuinely not connected at that moment
>>> and failing is the correct thing to do.
>>  
>> Umm...  That might be not good enough - the logics is "everything's
>> reachable from ->s_root anyway, so we might as well not bother checking".
>> For NFS it's simply not true.
>
> If I am parsing the code correctly path_connected is broken for nfsv2
> and nfsv3 when NFS_MOUNT_UNSHARED is not set.   nfsv4 appears to make
> a kernel mount of the real root of the filesystem properly setting
> s_root and then finds the child it is mounting.
>
>> We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b
>> and we'll have ->s_root pointing to a subtree of what's reachable at
>> /tmp/b.  Play with renames under /tmp/b and you just might end up with
>> a problem.  And mount on /tmp/a will be (mistakenly) considered to
>> be safe, since it satisfies the heuristics in path_connected().
>
> Agreed.
>
> Which means that if you mount server:/foo/bar/baz first and then
> mount server:/foo with an appropriate rename you might be able
> to see all of server:/foo or possibly server:/
>
> Hmm..
>
> Given that nfs_kill_super calls generic_shutdown_super and
> generic_shutdown_super calls shrink_dcache_for_umount
> I would argue that nfsv2 and nfsv3 are buggy in the same
> case, as shrink_dcache_for_umount is called on something
> that is not the root of the filesystem's dentry tree.

Ah.  I see now there is now the s_roots list that handles
that bit of strangeness.

So one path is to simply remove the heuristic from
path_connected.

Another path is to have nfsv2 and nfsv3 not set s_root at all.
Leaving the heuristic working for the rest of the filesystems,
and generally simplifying the code.

Something like the diff below I should think.

Eric

 fs/dcache.c      |  3 ++-
 fs/nfs/getroot.c | 35 +----------------------------------
 fs/nfs/super.c   |  2 --
 3 files changed, 3 insertions(+), 37 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..ebe63ca026da 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1501,7 +1501,8 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
 	dentry = sb->s_root;
 	sb->s_root = NULL;
-	do_one_tree(dentry);
+	if (dentry)
+		do_one_tree(dentry);
 
 	while (!hlist_bl_empty(&sb->s_roots)) {
 		dentry = dget(hlist_bl_entry(hlist_bl_first(&sb->s_roots), struct dentry, d_hash));
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index 391dafaf9182..f609d583fd2b 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -36,35 +36,6 @@
 
 #define NFSDBG_FACILITY		NFSDBG_CLIENT
 
-/*
- * Set the superblock root dentry.
- * Note that this function frees the inode in case of error.
- */
-static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *inode)
-{
-	/* The mntroot acts as the dummy root dentry for this superblock */
-	if (sb->s_root == NULL) {
-		sb->s_root = d_make_root(inode);
-		if (sb->s_root == NULL)
-			return -ENOMEM;
-		ihold(inode);
-		/*
-		 * Ensure that this dentry is invisible to d_find_alias().
-		 * Otherwise, it may be spliced into the tree by
-		 * d_splice_alias if a parent directory from the same
-		 * filesystem gets mounted at a later time.
-		 * This again causes shrink_dcache_for_umount_subtree() to
-		 * Oops, since the test for IS_ROOT() will fail.
-		 */
-		spin_lock(&d_inode(sb->s_root)->i_lock);
-		spin_lock(&sb->s_root->d_lock);
-		hlist_del_init(&sb->s_root->d_u.d_alias);
-		spin_unlock(&sb->s_root->d_lock);
-		spin_unlock(&d_inode(sb->s_root)->i_lock);
-	}
-	return 0;
-}
-
 /*
  * get an NFS2/NFS3 root dentry from the root filehandle
  */
@@ -102,11 +73,7 @@ struct dentry *nfs_get_root(struct super_block *sb, struct nfs_fh *mntfh,
 		goto out;
 	}
 
-	error = nfs_superblock_set_dummy_root(sb, inode);
-	if (error != 0) {
-		ret = ERR_PTR(error);
-		goto out;
-	}
+	/* Leave nfsv2 and nfsv3 s_root == NULL */
 
 	/* root dentries normally start off anonymous and get spliced in later
 	 * if the dentry tree reaches them; however if the dentry already
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 29bacdc56f6a..30b45ea4dfe9 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2625,9 +2625,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 		}
 		s->s_bdi->ra_pages = server->rpages * NFS_MAX_READAHEAD;
 		server->super = s;
-	}
 
-	if (!s->s_root) {
 		/* initial superblock/root creation */
 		mount_info->fill_super(s, mount_info);
 		nfs_get_cache_cookie(s, mount_info->parsed, mount_info->cloned);

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 23:28                             ` Eric W. Biederman
  2018-03-12 23:52                               ` Eric W. Biederman
@ 2018-03-13  0:36                               ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-03-13  0:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Mon, Mar 12, 2018 at 06:28:30PM -0500, Eric W. Biederman wrote:

> Given that nfs_kill_super calls generic_shutdown_super and
> generic_shutdown_super calls shrink_dcache_for_umount
> I would argue that nfsv2 and nfsv3 are buggy in the same
> case, as shrink_dcache_for_umount is called on something
> that is not the root of the filesystem's dentry tree.

That's not a problem - it will simply evict a subtree first,
then go through ->s_roots and kill each piece.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 23:52                               ` Eric W. Biederman
@ 2018-03-13  0:37                                 ` Al Viro
  2018-03-13  0:50                                   ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: Al Viro @ 2018-03-13  0:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Mon, Mar 12, 2018 at 06:52:31PM -0500, Eric W. Biederman wrote:

> Ah.  I see now there is now the s_roots list that handles
> that bit of strangeness.
> 
> So one path is to simply remove the heuristic from
> path_connected.
> 
> Another path is to have nfsv2 and nfsv3 not set s_root at all.
> Leaving the heuristic working for the rest of the filesystems,
> and generally simplifying the code.
> 
> Something like the diff below I should think.

> +	/* Leave nfsv2 and nfsv3 s_root == NULL */

Now, grep fs/super.c for s_root.  Or try to boot it, for that
matter...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13  0:37                                 ` Al Viro
@ 2018-03-13  0:50                                   ` Al Viro
  2018-03-13  4:02                                     ` Eric W. Biederman
  2018-03-14 23:20                                     ` [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots Eric W. Biederman
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2018-03-13  0:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Tue, Mar 13, 2018 at 12:37:51AM +0000, Al Viro wrote:
> On Mon, Mar 12, 2018 at 06:52:31PM -0500, Eric W. Biederman wrote:
> 
> > Ah.  I see now there is now the s_roots list that handles
> > that bit of strangeness.
> > 
> > So one path is to simply remove the heuristic from
> > path_connected.
> > 
> > Another path is to have nfsv2 and nfsv3 not set s_root at all.
> > Leaving the heuristic working for the rest of the filesystems,
> > and generally simplifying the code.
> > 
> > Something like the diff below I should think.
> 
> > +	/* Leave nfsv2 and nfsv3 s_root == NULL */
> 
> Now, grep fs/super.c for s_root.  Or try to boot it, for that
> matter...

BTW, if rename happens on server and we step into directory
we'd already seen in one subtree while doing a lookup in
another, we will get it moved around.  Without having the
subtrees ever connected in dcache on client.  So adding
&& IS_ROOT(sb->s_root) to the test also won't work.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 20:05                         ` Al Viro
  2018-03-12 20:33                           ` Al Viro
@ 2018-03-13  1:12                           ` NeilBrown
  2018-04-28  0:10                             ` Al Viro
  1 sibling, 1 reply; 46+ messages in thread
From: NeilBrown @ 2018-03-13  1:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

[-- Attachment #1: Type: text/plain, Size: 4152 bytes --]

On Mon, Mar 12 2018, Al Viro wrote:

> On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote:
>
>> There's an unpleasant area around the ->s_root vs. NFS.  There's
>> code that makes assumptions about ->s_root that are simply not true
>> for NFS.  Is path_connected() correct wrt NFS multiple imports from
>> the same server?  Ditto for mnt_already_visible() (that one might
>> be mitigated at the moment, but probably won't last).  Eric, am
>> I missing something subtle in there?
>
> BTW, another thing spotted while trying to document the stuff:
> Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle
> side effect I hadn't spotted when applying.  Namely, for
> DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have
> d_unhashed() true.  That makes d_find_alias() logics around
> discon_alias dead code:
>                 if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
>                         if (IS_ROOT(alias) &&
>                             (alias->d_flags & DCACHE_DISCONNECTED)) {
>                                 discon_alias = alias;
>                         } else {
>                                 __dget_dlock(alias);
>                                 spin_unlock(&alias->d_lock);
>                                 return alias;
>                         }
>                 }

I agree that this code is now more complex than needed.

>
> Directory case is a red herring - we'll end up picking one and only
> alias, whether it's IS_ROOT/disconnected or not.  For non-directories,
> though, IS_ROOT and disconnected implies d_unhashed() now, so we might
> as well turn that into
> 	if directory
> 		return d_find_any_alias
> 	else
> 		return the first hashed alias

Yes, this looks like a good simplification.

>
> Does any of the d_find_alias() callers want those dentries?  That is,
> non-directories from d_obtain_alias() still not attached to a parent;
> note that exportfs_decode_fh() is *not* one of such places - we don't
> use d_find_alias() there at all.  If there's no such caller, we can
> bloody well just drop the discon_alias logics and be done with that;
> if there is, that commit has introduced a bug.  I might have missed
> a part of threads related to that patch, so my apologies if it had
> been discussed.
>
> Neil, what's the situation there?  A lot of those callers clearly treat
> the "only disconnected IS_ROOT alias exist" same as "no aliases at all";
> it looks like the change might have been the right thing, but it sure
> as hell shouldn't be an undocumented side effect...

Many of the callers really want a name, which these disconnected
dentries didn't have, so the new code is more correct in those cases.

d_find_alias() is documented as returning a hashed dentry (for
non-directories) and I suspect most people would be surprised to find
that a disconnected dentry with no name was considered to be "hashed",
so I think the patch could be seen as fixing the documentation.

Of the three users that you didn't classify as "clearly" not wanting the
disconnected dentries:

>	     * ceph invalidate_aliases()

This wants to ensure the dentries are discarded on final dput().  This
is already the case for disconnected dentries, so it doesn't need to be
told about them.  Code is correct.


>	     * cap_inode_getsecurity()

If this gets invoked when the nfsd server calls vfs_getxattr() on a
disconnected dentry, it will return -EINVAL.  This looks like a
potential bug.
nfsd only calls vfs_getxattr() in nfsd4_is_junction() so this bug would
not affect many use cases.
I think cap_inode_getsecurity() should really be calling
d_find_any_alias().

>	     * selinux inode_doinit_with_dentry() (two call sites, very alike)

I'm less sure about this one, but I think it probably wants
d_find_any_alias() as well.  Like cap_inode_getsecurity() it only wants
a dentry so that it can pass something to __vfs_getxattr(),
and that only wants a dentry so it can pass something to ->get.

Possibly we should rename d_find_alias() to d_find_hashed_alias() so that
people need to make a conscious choice between d_find_hashed_alias() and
d_find_any_alias() ??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13  0:50                                   ` Al Viro
@ 2018-03-13  4:02                                     ` Eric W. Biederman
  2018-03-14 23:20                                     ` [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots Eric W. Biederman
  1 sibling, 0 replies; 46+ messages in thread
From: Eric W. Biederman @ 2018-03-13  4:02 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Mar 13, 2018 at 12:37:51AM +0000, Al Viro wrote:
>> On Mon, Mar 12, 2018 at 06:52:31PM -0500, Eric W. Biederman wrote:
>> 
>> > Ah.  I see now there is now the s_roots list that handles
>> > that bit of strangeness.
>> > 
>> > So one path is to simply remove the heuristic from
>> > path_connected.
>> > 
>> > Another path is to have nfsv2 and nfsv3 not set s_root at all.
>> > Leaving the heuristic working for the rest of the filesystems,
>> > and generally simplifying the code.
>> > 
>> > Something like the diff below I should think.
>> 
>> > +	/* Leave nfsv2 and nfsv3 s_root == NULL */
>> 
>> Now, grep fs/super.c for s_root.  Or try to boot it, for that
>> matter...
>
> BTW, if rename happens on server and we step into directory
> we'd already seen in one subtree while doing a lookup in
> another, we will get it moved around.  Without having the
> subtrees ever connected in dcache on client.  So adding
> && IS_ROOT(sb->s_root) to the test also won't work.

Nope.  We fundamentally need to call is_subdir in the nfs case
to ensure we don't have crazy problems.

I believe below is the obviously correct fix (that still preserves some
caching).  I need to look at nilfs as it also calls d_obtain_root.  I am
also wondering if some of the other network filesystems might be
susceptible to problems caused by renames on the server.

It is tempting to be more clever and not consider NFS_MOUNT_UNSHARED
mounts or mounts without mulitple s_roots but there can be renames
on the server that should trip up even those cases.  At least if
anything figures out how to trigger the dentries in the path to ever get
revalidated.

Eric

diff --git a/fs/namei.c b/fs/namei.c
index 921ae32dbc80..cafa365eeb70 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -559,9 +559,10 @@ static int __nd_alloc_stack(struct nameidata *nd)
 static bool path_connected(const struct path *path)
 {
 	struct vfsmount *mnt = path->mnt;
+	struct super_block *sb = mnt->mnt_sb;
 
-	/* Only bind mounts can have disconnected paths */
-	if (mnt->mnt_root == mnt->mnt_sb->s_root)
+	/* Bind mounts and multi-root filesystems can have disconnected paths */
+	if (!(sb->s_iflags & SB_I_MULTIROOT) && (mnt->mnt_root == sb->s_root))
 		return true;
 
 	return is_subdir(path->dentry, mnt->mnt_root);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 29bacdc56f6a..64129a72f312 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2631,6 +2631,7 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 		/* initial superblock/root creation */
 		mount_info->fill_super(s, mount_info);
 		nfs_get_cache_cookie(s, mount_info->parsed, mount_info->cloned);
+		s->s_iflags |= SB_I_MULTIROOT;
 	}
 
 	mntroot = nfs_get_root(s, mount_info->mntfh, dev_name);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a815560fda0..0430e03febaa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1317,6 +1317,7 @@ extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
 #define SB_I_NODEV	0x00000004	/* Ignore devices on this fs */
+#define SB_I_MULTIROOT	0x00000008	/* Multiple roots to the dentry tree */
 
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE		0x00000010 /* fstype already mounted */

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-12 19:13                       ` Al Viro
                                           ` (2 preceding siblings ...)
  2018-03-12 22:14                         ` Thomas Gleixner
@ 2018-03-13 20:46                         ` John Ogness
  2018-03-13 21:05                           ` John Ogness
  3 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-03-13 20:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On 2018-03-12, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>> If someone else has grabbed a reference, it shouldn't be added to the
>> lru list. Only decremented.
>> 
>> if (entry->d_lockref.count == 1)
>
> Nah, better handle that in retain_dentry() itself.  See updated
> #work.dcache.
>
> +	if (unlikely(dentry->d_lockref.count != 1)) {
> +		dentry->d_lockref.count--;
> +	} else if (likely(!retain_dentry(dentry))) {
> +		__dentry_kill(dentry);
> +		return parent;
> +	}

Although the updated version is correct (and saves on lines of code), I
find putting the deref and lru_add code in the "true" case of
retain_dentry() to be pretty tricky. IMHO the code is easier to
understand if it looks like this:

	if (unlikely(dentry->d_lockref.count != 1)) {
		dentry->d_lockref.count--;
	} else if (likely(!retain_dentry(dentry))) {
		__dentry_kill(dentry);
		return parent;
	} else {
		dentry->d_lockref.count--;
		dentry_lru_add(dentry);
	}

This is what your version is doing, but that final else is hiding in the
retain_dentry() "true" case.

My suggestion is to revert 7479f57fecd2a4837b5c79ce1cf0dcf284db54be (and
then fixup dput() to deref before calling dentry_lru_add()).

> FWIW, there's another trylock loop on dentries - one in
> autofs get_next_positive_dentry().  Any plans re dealing
> with that one?

I will need to dig into it a bit deeper (I am unfamiliar with autofs),
but it looks like it is trying to do basically the same thing as the
ascend loop in d_walk().

> I'd spent the last couple of weeks (when not being too sick
> for any work) going through dcache.c and related code; hopefully
> this time I will get the documentation into postable shape ;-/

Thank you for all your help in getting these changes cleaned and
correctly implemented so quickly.

I've reviewed your latest trylock loop removal patches and found only 1
minor issue. I'll post that separately.

John Ogness

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13 20:46                         ` John Ogness
@ 2018-03-13 21:05                           ` John Ogness
  2018-03-13 23:59                             ` Al Viro
  0 siblings, 1 reply; 46+ messages in thread
From: John Ogness @ 2018-03-13 21:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

Hi Al,

1 minor issue on the new shrink_lock_dentry()...

> From 121a8e0834862d2c5d88c95f8e6bc8984f195abf Mon Sep 17 00:00:00 2001
> From: Al Viro <viro@zeniv.linux.org.uk>
> Date: Fri, 23 Feb 2018 21:54:18 -0500
> Subject: [PATCH] get rid of trylock loop in locking dentries on shrink
>  list
>
> In case of trylock failure don't re-add to the list - drop the locks
> and carefully get them in the right order.  For shrink_dentry_list(),
> somebody having grabbed a reference to dentry means that we can
> kick it off-list, so if we find dentry being modified under us we
> don't need to play silly buggers with retries anyway - off the list
> it is.
>
> The locking logics taken out into a helper of its own; lock_parent()
> is no longer used for dentries that can be killed under us.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c | 103 ++++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 66 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 1684b6b..58097fd 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -974,56 +974,85 @@ void d_prune_aliases(struct inode *inode)
>  }
>  EXPORT_SYMBOL(d_prune_aliases);
>  
> -static void shrink_dentry_list(struct list_head *list)
> +/*
> + * Lock a dentry from shrink list.
> + * Note that dentry is *not* protected from concurrent dentry_kill(),
> + * d_delete(), etc.  It is protected from freeing (by the fact of
> + * being on a shrink list), but everything else is fair game.
> + * Return false if dentry has been disrupted or grabbed, leaving
> + * the caller to kick it off-list.  Otherwise, return true and have
> + * that dentry's inode and parent both locked.
> + */
> +static bool shrink_lock_dentry(struct dentry *dentry)
>  {
> -	struct dentry *dentry, *parent;
> +	struct inode *inode;
> +	struct dentry *parent;
>  
> -	while (!list_empty(list)) {
> -		struct inode *inode;
> -		dentry = list_entry(list->prev, struct dentry, d_lru);
> +	if (dentry->d_lockref.count)
> +		return false;
> +
> +	inode = dentry->d_inode;
> +	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
> +		rcu_read_lock();	/* to protect inode */
> +		spin_unlock(&dentry->d_lock);
> +		spin_lock(&inode->i_lock);
>  		spin_lock(&dentry->d_lock);
> -		parent = lock_parent(dentry);
> +		if (unlikely(dentry->d_lockref.count))
> +			goto out;
> +		/* changed inode means that somebody had grabbed it */
> +		if (unlikely(inode != dentry->d_inode))
> +			goto out;
> +		rcu_read_unlock();
> +	}
>  
> -		/*
> -		 * The dispose list is isolated and dentries are not accounted
> -		 * to the LRU here, so we can simply remove it from the list
> -		 * here regardless of whether it is referenced or not.
> -		 */
> -		d_shrink_del(dentry);
> +	parent = dentry->d_parent;
> +	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
> +		return true;
>  
> -		/*
> -		 * We found an inuse dentry which was not removed from
> -		 * the LRU because of laziness during lookup. Do not free it.
> -		 */
> -		if (dentry->d_lockref.count > 0) {
> -			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
> -			continue;
> -		}
> +	rcu_read_lock();		/* to protect parent */
> +	spin_unlock(&dentry->d_lock);
> +	parent = READ_ONCE(dentry->d_parent);

The preceeding line should be removed. We already have a "parent" from
before we did the most recent trylock().

> +	spin_lock(&parent->d_lock);
> +	if (unlikely(parent != dentry->d_parent)) {
> +		spin_unlock(&parent->d_lock);
> +		spin_lock(&dentry->d_lock);
> +		goto out;
> +	}
> +	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> +	if (likely(!dentry->d_lockref.count)) {
> +		rcu_read_unlock();
> +		return true;
> +	}
> +	spin_unlock(&parent->d_lock);
> +out:
> +	spin_unlock(&inode->i_lock);
> +	rcu_read_unlock();
> +	return false;
> +}
>  
> +static void shrink_dentry_list(struct list_head *list)
> +{
> +	while (!list_empty(list)) {
> +		struct dentry *dentry, *parent;
> +		struct inode *inode;
>  
> -		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> -			bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
> +		dentry = list_entry(list->prev, struct dentry, d_lru);
> +		spin_lock(&dentry->d_lock);
> +		if (!shrink_lock_dentry(dentry)) {
> +			bool can_free = false;
> +			d_shrink_del(dentry);
> +			if (dentry->d_lockref.count < 0)
> +				can_free = dentry->d_flags & DCACHE_MAY_FREE;
>  			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
>  			if (can_free)
>  				dentry_free(dentry);
>  			continue;
>  		}
> -
> -		inode = dentry->d_inode;
> -		if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
> -			d_shrink_add(dentry, list);
> -			spin_unlock(&dentry->d_lock);
> -			if (parent)
> -				spin_unlock(&parent->d_lock);
> -			continue;
> -		}
> -
> +		d_shrink_del(dentry);
> +		parent = dentry->d_parent;
>  		__dentry_kill(dentry);
> -
> +		if (parent == dentry)
> +			continue;
>  		/*
>  		 * We need to prune ancestors too. This is necessary to prevent
>  		 * quadratic behavior of shrink_dcache_parent(), but is also

John Ogness

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13 21:05                           ` John Ogness
@ 2018-03-13 23:59                             ` Al Viro
  2018-03-14  2:58                               ` Matthew Wilcox
  2018-03-14  8:18                               ` John Ogness
  0 siblings, 2 replies; 46+ messages in thread
From: Al Viro @ 2018-03-13 23:59 UTC (permalink / raw)
  To: John Ogness
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Tue, Mar 13, 2018 at 10:05:55PM +0100, John Ogness wrote:

> > +	rcu_read_lock();		/* to protect parent */
> > +	spin_unlock(&dentry->d_lock);
> > +	parent = READ_ONCE(dentry->d_parent);
> 
> The preceeding line should be removed. We already have a "parent" from
> before we did the most recent trylock().

Nope.  We have parent, yes, but it had been fetched outside of rcu_read_lock().
So the object it used to point to might have been already freed and we
can't do this:

> > +	spin_lock(&parent->d_lock);

To get rid of that reread we'd need to do this:
	rcu_read_lock();
        parent = dentry->d_parent;
        if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock))) {
		rcu_read_unlock();
                return true;
	}
        spin_unlock(&dentry->d_lock);
        spin_lock(&parent->d_lock);
        if (unlikely(parent != dentry->d_parent)) {
		....

Come to think of that, it might make sense to lift rcu_read_lock() all the
way out of that sucker.  Objections?  Below is the incremental I'd fold into
that commit:

diff --git a/fs/dcache.c b/fs/dcache.c
index f0e73c93182b..0d1dac750c0a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1000,7 +1000,6 @@ static bool shrink_lock_dentry(struct dentry *dentry)
 
 	inode = dentry->d_inode;
 	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
-		rcu_read_lock();	/* to protect inode */
 		spin_unlock(&dentry->d_lock);
 		spin_lock(&inode->i_lock);
 		spin_lock(&dentry->d_lock);
@@ -1009,16 +1008,14 @@ static bool shrink_lock_dentry(struct dentry *dentry)
 		/* changed inode means that somebody had grabbed it */
 		if (unlikely(inode != dentry->d_inode))
 			goto out;
-		rcu_read_unlock();
 	}
 
 	parent = dentry->d_parent;
+	/* parent will stay allocated until we drop rcu_read_lock */
 	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
 		return true;
 
-	rcu_read_lock();		/* to protect parent */
 	spin_unlock(&dentry->d_lock);
-	parent = READ_ONCE(dentry->d_parent);
 	spin_lock(&parent->d_lock);
 	if (unlikely(parent != dentry->d_parent)) {
 		spin_unlock(&parent->d_lock);
@@ -1026,14 +1023,11 @@ static bool shrink_lock_dentry(struct dentry *dentry)
 		goto out;
 	}
 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-	if (likely(!dentry->d_lockref.count)) {
-		rcu_read_unlock();
+	if (likely(!dentry->d_lockref.count))
 		return true;
-	}
 	spin_unlock(&parent->d_lock);
 out:
 	spin_unlock(&inode->i_lock);
-	rcu_read_unlock();
 	return false;
 }
 
@@ -1044,8 +1038,10 @@ static void shrink_dentry_list(struct list_head *list)
 
 		dentry = list_entry(list->prev, struct dentry, d_lru);
 		spin_lock(&dentry->d_lock);
+		rcu_read_lock();
 		if (!shrink_lock_dentry(dentry)) {
 			bool can_free = false;
+			rcu_read_unlock();
 			d_shrink_del(dentry);
 			if (dentry->d_lockref.count < 0)
 				can_free = dentry->d_flags & DCACHE_MAY_FREE;
@@ -1054,6 +1050,7 @@ static void shrink_dentry_list(struct list_head *list)
 				dentry_free(dentry);
 			continue;
 		}
+		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
 		__dentry_kill(dentry);

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13 23:59                             ` Al Viro
@ 2018-03-14  2:58                               ` Matthew Wilcox
  2018-03-14  8:18                               ` John Ogness
  1 sibling, 0 replies; 46+ messages in thread
From: Matthew Wilcox @ 2018-03-14  2:58 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Tue, Mar 13, 2018 at 11:59:30PM +0000, Al Viro wrote:
> Come to think of that, it might make sense to lift rcu_read_lock() all the
> way out of that sucker.  Objections?  Below is the incremental I'd fold into
> that commit:

Looks clearer to me.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13 23:59                             ` Al Viro
  2018-03-14  2:58                               ` Matthew Wilcox
@ 2018-03-14  8:18                               ` John Ogness
  1 sibling, 0 replies; 46+ messages in thread
From: John Ogness @ 2018-03-14  8:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On 2018-03-14, Al Viro <viro@ZenIV.linux.org.uk> wrote:
>>> +	rcu_read_lock();		/* to protect parent */
>>> +	spin_unlock(&dentry->d_lock);
>>> +	parent = READ_ONCE(dentry->d_parent);
>> 
>> The preceeding line should be removed. We already have a "parent"
>> from before we did the most recent trylock().
>
> Nope.  We have parent, yes, but it had been fetched outside of
> rcu_read_lock().  So the object it used to point to might have been
> already freed and we can't do this:
>
>>> +	spin_lock(&parent->d_lock);

When rcu_read_lock() is called, we are still holding dentry->d_lock. At
that point dentry->d_parent cannot have changed and cannot have been
freed. So the parent fetched outside of rcu_read_lock() is also
protected from freeing inside that rcu_read_lock().

> Come to think of that, it might make sense to lift rcu_read_lock() all
> the way out of that sucker.

Agreed.

> Objections?  Below is the incremental I'd fold into that commit:
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f0e73c93182b..0d1dac750c0a 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1000,7 +1000,6 @@ static bool shrink_lock_dentry(struct dentry *dentry)
>  
>  	inode = dentry->d_inode;
>  	if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
> -		rcu_read_lock();	/* to protect inode */
>  		spin_unlock(&dentry->d_lock);
>  		spin_lock(&inode->i_lock);
>  		spin_lock(&dentry->d_lock);
> @@ -1009,16 +1008,14 @@ static bool shrink_lock_dentry(struct dentry *dentry)
>  		/* changed inode means that somebody had grabbed it */
>  		if (unlikely(inode != dentry->d_inode))
>  			goto out;
> -		rcu_read_unlock();
>  	}
>  
>  	parent = dentry->d_parent;
> +	/* parent will stay allocated until we drop rcu_read_lock */

I think this comment is not necessary since this function no longer
deals with dropping rcu_read_lock. But if we keep it, it should be added
for the inode above as well.

>  	if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock)))
>  		return true;
>  
> -	rcu_read_lock();		/* to protect parent */
>  	spin_unlock(&dentry->d_lock);
> -	parent = READ_ONCE(dentry->d_parent);
>  	spin_lock(&parent->d_lock);
>  	if (unlikely(parent != dentry->d_parent)) {
>  		spin_unlock(&parent->d_lock);
> @@ -1026,14 +1023,11 @@ static bool shrink_lock_dentry(struct dentry *dentry)
>  		goto out;
>  	}
>  	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
> -	if (likely(!dentry->d_lockref.count)) {
> -		rcu_read_unlock();
> +	if (likely(!dentry->d_lockref.count))
>  		return true;
> -	}
>  	spin_unlock(&parent->d_lock);
>  out:
>  	spin_unlock(&inode->i_lock);
> -	rcu_read_unlock();
>  	return false;
>  }
>  
> @@ -1044,8 +1038,10 @@ static void shrink_dentry_list(struct list_head *list)
>  
>  		dentry = list_entry(list->prev, struct dentry, d_lru);
>  		spin_lock(&dentry->d_lock);
> +		rcu_read_lock();
>  		if (!shrink_lock_dentry(dentry)) {
>  			bool can_free = false;
> +			rcu_read_unlock();
>  			d_shrink_del(dentry);
>  			if (dentry->d_lockref.count < 0)
>  				can_free = dentry->d_flags & DCACHE_MAY_FREE;
> @@ -1054,6 +1050,7 @@ static void shrink_dentry_list(struct list_head *list)
>  				dentry_free(dentry);
>  			continue;
>  		}
> +		rcu_read_unlock();
>  		d_shrink_del(dentry);
>  		parent = dentry->d_parent;
>  		__dentry_kill(dentry);

John Ogness

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots.
  2018-03-13  0:50                                   ` Al Viro
  2018-03-13  4:02                                     ` Eric W. Biederman
@ 2018-03-14 23:20                                     ` Eric W. Biederman
  2018-03-15 22:34                                       ` Al Viro
  1 sibling, 1 reply; 46+ messages in thread
From: Eric W. Biederman @ 2018-03-14 23:20 UTC (permalink / raw)
  To: Al Viro
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List


On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem.  The subsets can be from
disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem.  The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount.  This move can happen either locally or
remotely.  With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check.  As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally.  Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to.  So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar.  But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Al do you want to push this one to Linus or shall I?

 fs/namei.c         | 5 +++--
 fs/nfs/super.c     | 2 ++
 include/linux/fs.h | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 921ae32dbc80..cafa365eeb70 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -559,9 +559,10 @@ static int __nd_alloc_stack(struct nameidata *nd)
 static bool path_connected(const struct path *path)
 {
 	struct vfsmount *mnt = path->mnt;
+	struct super_block *sb = mnt->mnt_sb;
 
-	/* Only bind mounts can have disconnected paths */
-	if (mnt->mnt_root == mnt->mnt_sb->s_root)
+	/* Bind mounts and multi-root filesystems can have disconnected paths */
+	if (!(sb->s_iflags & SB_I_MULTIROOT) && (mnt->mnt_root == sb->s_root))
 		return true;
 
 	return is_subdir(path->dentry, mnt->mnt_root);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 29bacdc56f6a..5e470e233c83 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2631,6 +2631,8 @@ struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 		/* initial superblock/root creation */
 		mount_info->fill_super(s, mount_info);
 		nfs_get_cache_cookie(s, mount_info->parsed, mount_info->cloned);
+		if (!(server->flags & NFS_MOUNT_UNSHARED))
+			s->s_iflags |= SB_I_MULTIROOT;
 	}
 
 	mntroot = nfs_get_root(s, mount_info->mntfh, dev_name);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a815560fda0..0430e03febaa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1317,6 +1317,7 @@ extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
 #define SB_I_NODEV	0x00000004	/* Ignore devices on this fs */
+#define SB_I_MULTIROOT	0x00000008	/* Multiple roots to the dentry tree */
 
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE		0x00000010 /* fstype already mounted */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots.
  2018-03-14 23:20                                     ` [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots Eric W. Biederman
@ 2018-03-15 22:34                                       ` Al Viro
  0 siblings, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-03-15 22:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: John Ogness, Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Wed, Mar 14, 2018 at 06:20:29PM -0500, Eric W. Biederman wrote:
> 
> On nfsv2 and nfsv3 the nfs server can export subsets of the same
> filesystem and report the same filesystem identifier, so that the nfs
> client can know they are the same filesystem.  The subsets can be from
> disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
> way to find the common root of all directory trees exported form the
> server with the same filesystem identifier.
> 
> The practical result is that in struct super s_root for nfs s_root is
> not necessarily the root of the filesystem.  The nfs mount code sets
> s_root to the root of the first subset of the nfs filesystem that the
> kernel mounts.
> 
> This effects the dcache invalidation code in generic_shutdown_super
> currently called shrunk_dcache_for_umount and that code for years
> has gone through an additional list of dentries that might be dentry
> trees that need to be freed to accomodate nfs.
> 
> When I wrote path_connected I did not realize nfs was so special, and
> it's hueristic for avoiding calling is_subdir can fail.
> 
> The practical case where this fails is when there is a move of a
> directory from the subtree exposed by one nfs mount to the subtree
> exposed by another nfs mount.  This move can happen either locally or
> remotely.  With the remote case requiring that the move directory be cached
> before the move and that after the move someone walks the path
> to where the move directory now exists and in so doing causes the
> already cached directory to be moved in the dcache through the magic
> of d_splice_alias.
> 
> If someone whose working directory is in the move directory or a
> subdirectory and now starts calling .. from the initial mount of nfs
> (where s_root == mnt_root), then path_connected as a heuristic will
> not bother with the is_subdir check.  As s_root really is not the root
> of the nfs filesystem this heuristic is wrong, and the path may
> actually not be connected and path_connected can fail.
> 
> The is_subdir function might be cheap enough that we can call it
> unconditionally.  Verifying that will take some benchmarking and
> the result may not be the same on all kernels this fix needs
> to be backported to.  So I am avoiding that for now.
> 
> Filesystems with snapshots such as nilfs and btrfs do something
> similar.  But as the directory tree of the snapshots are disjoint
> from one another and from the main directory tree rename won't move
> things between them and this problem will not occur.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
> Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> Al do you want to push this one to Linus or shall I?

Applied; I think there might be a helper lurking in there, but for now
that'll do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list())
  2018-03-13  1:12                           ` NeilBrown
@ 2018-04-28  0:10                             ` Al Viro
  0 siblings, 0 replies; 46+ messages in thread
From: Al Viro @ 2018-04-28  0:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Linus Torvalds, linux-fsdevel, Christoph Hellwig,
	Thomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Eric Biederman

On Tue, Mar 13, 2018 at 12:12:51PM +1100, NeilBrown wrote:

> >	     * selinux inode_doinit_with_dentry() (two call sites, very alike)
> 
> I'm less sure about this one, but I think it probably wants
> d_find_any_alias() as well.  Like cap_inode_getsecurity() it only wants
> a dentry so that it can pass something to __vfs_getxattr(),
> and that only wants a dentry so it can pass something to ->get.
> 
> Possibly we should rename d_find_alias() to d_find_hashed_alias() so that
> people need to make a conscious choice between d_find_hashed_alias() and
> d_find_any_alias() ??

FWIW, it *is* a bug; this
                        /*
                         * this is can be hit on boot when a file is accessed
                         * before the policy is loaded.  When we load policy we
                         * may find inodes that have no dentry on the
                         * sbsec->isec_head list.  No reason to complain as these
                         * will get fixed up the next time we go through
                         * inode_doinit with a dentry, before these inodes could
                         * be used again by userspace.
                         */
in selinux/hooks.c is flat-out wrong now.  Sure, if you first load selinux
policy after exporting something over NFS or letting attacker play with
open-by-fhandle, you are past any help, but still...

I disagree about going for d_find_any_alias() from the very beginning, BTW -
we need to try it in case of d_find_alias() failure, but sufficiently crappy
filesystem can bloody well fail to access xattrs via disconnected dentry.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2018-04-28  0:10 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-22 23:50 [PATCH v2 0/6] fs/dcache: avoid trylock loops John Ogness
2018-02-22 23:50 ` [PATCH v2 1/6] fs/dcache: Remove stale comment from dentry_kill() John Ogness
2018-02-22 23:50 ` [PATCH v2 2/6] fs/dcache: Move dentry_kill() below lock_parent() John Ogness
2018-02-22 23:50 ` [PATCH v2 3/6] fs/dcache: Avoid the try_lock loop in d_delete() John Ogness
2018-02-23  2:08   ` Al Viro
2018-02-22 23:50 ` [PATCH v2 4/6] fs/dcache: Avoid the try_lock loops in dentry_kill() John Ogness
2018-02-23  2:22   ` Al Viro
2018-02-23  3:12     ` Al Viro
2018-02-23  3:16       ` Al Viro
2018-02-23  5:46       ` Al Viro
2018-02-22 23:50 ` [PATCH v2 5/6] fs/dcache: Avoid a try_lock loop in shrink_dentry_list() John Ogness
2018-02-23  3:48   ` Al Viro
2018-02-22 23:50 ` [PATCH v2 6/6] fs/dcache: Avoid remaining " John Ogness
2018-02-23  3:58   ` Al Viro
2018-02-23  4:08     ` Al Viro
2018-02-23 13:57       ` John Ogness
2018-02-23 15:09         ` Al Viro
2018-02-23 17:42           ` Al Viro
2018-02-23 20:13             ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Al Viro
2018-02-23 21:35               ` Linus Torvalds
2018-02-24  0:22                 ` Al Viro
2018-02-25  7:40                   ` Al Viro
2018-02-27  5:16                     ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) John Ogness
2018-03-12 19:13                       ` Al Viro
2018-03-12 20:05                         ` Al Viro
2018-03-12 20:33                           ` Al Viro
2018-03-13  1:12                           ` NeilBrown
2018-04-28  0:10                             ` Al Viro
2018-03-12 20:23                         ` Eric W. Biederman
2018-03-12 20:39                           ` Al Viro
2018-03-12 23:28                             ` Eric W. Biederman
2018-03-12 23:52                               ` Eric W. Biederman
2018-03-13  0:37                                 ` Al Viro
2018-03-13  0:50                                   ` Al Viro
2018-03-13  4:02                                     ` Eric W. Biederman
2018-03-14 23:20                                     ` [PATCH] fs: Teach path_connected to handle nfs filesystems with multiple roots Eric W. Biederman
2018-03-15 22:34                                       ` Al Viro
2018-03-13  0:36                               ` dcache: remove trylock loops (was Re: [BUG] lock_parent() breakage when used from shrink_dentry_list()) Al Viro
2018-03-12 22:14                         ` Thomas Gleixner
2018-03-13 20:46                         ` John Ogness
2018-03-13 21:05                           ` John Ogness
2018-03-13 23:59                             ` Al Viro
2018-03-14  2:58                               ` Matthew Wilcox
2018-03-14  8:18                               ` John Ogness
2018-03-02  9:04                     ` [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list()) Sebastian Andrzej Siewior
2018-02-23  0:59 ` [PATCH v2 0/6] fs/dcache: avoid trylock loops Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).