From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753517AbcFFXvL (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 19:51:11 -0400
Received: from mail-pf0-f194.google.com ([209.85.192.194]:33115 "EHLO
	mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751805AbcFFXvH (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 19:51:07 -0400
Date: Mon, 6 Jun 2016 16:50:59 -0700 (PDT)
From: Linus Torvalds <torvalds@linux-foundation.org>
X-X-Sender: torvalds@i7
To: Al Viro <viro@zeniv.linux.org.uk>
cc: Dave Hansen <dave.hansen@intel.com>, "Chen, Tim C" <tim.c.chen@intel.com>,
        Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso <dbueso@suse.de>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Jason Low <jason.low2@hp.com>, Michel Lespinasse <walken@google.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Waiman Long <waiman.long@hp.com>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion
In-Reply-To: <20160606220753.GG14480@ZenIV.linux.org.uk>
Message-ID: <alpine.LFD.2.20.1606061649070.12258@i7>
References: <5755D671.9070908@intel.com> <CA+55aFxH_7wjo_BgUPK5iomWedE2=DaUZVX-yruHOWEk7OTiHQ@mail.gmail.com> <20160606211522.GF14480@ZenIV.linux.org.uk> <CA+55aFw6oTECJ-GvYwkz4+RBgBoKiwPXfygOCZzX1V=KUEMG-g@mail.gmail.com>
 <20160606220753.GG14480@ZenIV.linux.org.uk>
User-Agent: Alpine 2.20 (LFD 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On Mon, 6 Jun 2016, Al Viro wrote:
> 
> True in general, but here we really do a lot under that ->d_lock - all
> list traversals are under it.  So I suspect that contention on nested
> lock is not an issue in that particular load.  It's certainly a separate
> commit, so we'll see how much does it give on its own, but I doubt that
> it'll be anywhere near enough.

Hmm. Maybe. 

But at least we can try to minimize everything that happens under the 
dentry->d_lock spinlock. 

So how about this patch? It's entirely untested, but it rewrites that 
readdir() function to try to do the minimum possible under the d_lock 
spinlock.

I say "rewrite", because it really is totally different. It's not just 
that the nested "next" locking is gone, it also treats the cursor very 
differently and tries to avoid doing any unnecessary cursor list 
operations.

So instead of "list_move()" at the beginning of traversal, and another for 
each successful dir_emit(), it just updates a "move target". The target is 
stable because it will only ever be a positive directory entry that we 
emitted, which is stable due to having the directory inode held for 
reading.

So now there is only one final "list_move()" at the end, and it's only 
done if we actually emitted a directory entry.

That makes the whole actual loop when we hold the d_lock very tight. So we 
hold the d_lock for truly the minimal possible case, namely the actual 
list traversal.

And sure, we keep dropping and re-taking it, so we'll get a fair amount of 
cacheline ping-pong if we have several CPU's doing this on the same 
directory at the same time. But minimizing the work we do inside the 
spinlock should hopefully mean that we get very little actual spinning.

And I do want to repeat that the patch is entirely untested. It compiles. 
I looked at the assembly it generated. It looks fine to me, but I might 
have had a brainfart and done something completely broken.

			Linus

---

 fs/libfs.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3db2721144c2..00a0b3a6b23e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -132,45 +132,54 @@ static inline unsigned char dt_type(struct inode *inode)
 	return (inode->i_mode >> 12) & 15;
 }
 
+static inline int dcache_emit_entry(struct dir_context *ctx, struct dentry *dir, struct dentry *child)
+{
+	int ret;
+	struct inode *inode;
+
+	spin_unlock(&dir->d_lock);
+	inode = d_inode(child);
+	ret = dir_emit(ctx, child->d_name.name, child->d_name.len, inode->i_ino,  dt_type(inode));
+	spin_lock(&dir->d_lock);
+	return ret;
+}
+
 /*
  * Directory is locked and all positive dentries in it are safe, since
  * for ramfs-type trees they can't go away without unlink() or rmdir(),
  * both impossible due to the lock on directory.
  */
-
 int dcache_readdir(struct file *file, struct dir_context *ctx)
 {
-	struct dentry *dentry = file->f_path.dentry;
-	struct dentry *cursor = file->private_data;
-	struct list_head *p, *q = &cursor->d_child;
+	struct dentry *dentry, *cursor;
+	struct list_head *entry, *target;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
+
+	dentry = file->f_path.dentry;
 	spin_lock(&dentry->d_lock);
-	if (ctx->pos == 2)
-		list_move(q, &dentry->d_subdirs);
-
-	for (p = q->next; p != &dentry->d_subdirs; p = p->next) {
-		struct dentry *next = list_entry(p, struct dentry, d_child);
-		spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
-		if (!simple_positive(next)) {
-			spin_unlock(&next->d_lock);
+
+	cursor = file->private_data;
+	entry = (ctx->pos == 2) ? &dentry->d_subdirs : &cursor->d_child;
+
+	target = NULL;
+	while ((entry = entry->next) != &dentry->d_subdirs) {
+		struct dentry *child = list_entry(entry, struct dentry, d_child);
+
+		if (!simple_positive(child))
 			continue;
-		}
 
-		spin_unlock(&next->d_lock);
-		spin_unlock(&dentry->d_lock);
-		if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
-			      d_inode(next)->i_ino, dt_type(d_inode(next))))
+		/* This will drop and re-take the dentry lock .. */
+		if (!dcache_emit_entry(ctx, dentry, child))
 			return 0;
-		spin_lock(&dentry->d_lock);
-		spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
-		/* next is still alive */
-		list_move(q, p);
-		spin_unlock(&next->d_lock);
-		p = q;
+
+		/* .. but 'entry' is stable because of the shared rwsem */
+		target = entry;
 		ctx->pos++;
 	}
+	if (target)
+		list_move(&cursor->d_child, target);
 	spin_unlock(&dentry->d_lock);
 	return 0;
 }