From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753517AbcFFXvL (ORCPT ); Mon, 6 Jun 2016 19:51:11 -0400 Received: from mail-pf0-f194.google.com ([209.85.192.194]:33115 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751805AbcFFXvH (ORCPT ); Mon, 6 Jun 2016 19:51:07 -0400 Date: Mon, 6 Jun 2016 16:50:59 -0700 (PDT) From: Linus Torvalds X-X-Sender: torvalds@i7 To: Al Viro cc: Dave Hansen , "Chen, Tim C" , Ingo Molnar , Davidlohr Bueso , "Peter Zijlstra (Intel)" , Jason Low , Michel Lespinasse , "Paul E. McKenney" , Waiman Long , LKML Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion In-Reply-To: <20160606220753.GG14480@ZenIV.linux.org.uk> Message-ID: References: <5755D671.9070908@intel.com> <20160606211522.GF14480@ZenIV.linux.org.uk> <20160606220753.GG14480@ZenIV.linux.org.uk> User-Agent: Alpine 2.20 (LFD 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 6 Jun 2016, Al Viro wrote: > > True in general, but here we really do a lot under that ->d_lock - all > list traversals are under it. So I suspect that contention on nested > lock is not an issue in that particular load. It's certainly a separate > commit, so we'll see how much does it give on its own, but I doubt that > it'll be anywhere near enough. Hmm. Maybe.  But at least we can try to minimize everything that happens under the dentry->d_lock spinlock.  So how about this patch? It's entirely untested, but it rewrites that readdir() function to try to do the minimum possible under the d_lock spinlock. I say "rewrite", because it really is totally different. It's not just that the nested "next" locking is gone, it also treats the cursor very differently and tries to avoid doing any unnecessary cursor list operations. So instead of "list_move()" at the beginning of traversal, and another for each successful dir_emit(), it just updates a "move target". The target is stable because it will only ever be a positive directory entry that we emitted, which is stable due to having the directory inode held for reading. So now there is only one final "list_move()" at the end, and it's only done if we actually emitted a directory entry. That makes the whole actual loop when we hold the d_lock very tight. So we hold the d_lock for truly the minimal possible case, namely the actual list traversal. And sure, we keep dropping and re-taking it, so we'll get a fair amount of cacheline ping-pong if we have several CPU's doing this on the same directory at the same time. But minimizing the work we do inside the spinlock should hopefully mean that we get very little actual spinning. And I do want to repeat that the patch is entirely untested. It compiles. I looked at the assembly it generated. It looks fine to me, but I might have had a brainfart and done something completely broken. Linus --- fs/libfs.c | 55 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 32 insertions(+), 23 deletions(-) diff --git a/fs/libfs.c b/fs/libfs.c index 3db2721144c2..00a0b3a6b23e 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -132,45 +132,54 @@ static inline unsigned char dt_type(struct inode *inode) return (inode->i_mode >> 12) & 15; } +static inline int dcache_emit_entry(struct dir_context *ctx, struct dentry *dir, struct dentry *child) +{ + int ret; + struct inode *inode; + + spin_unlock(&dir->d_lock); + inode = d_inode(child); + ret = dir_emit(ctx, child->d_name.name, child->d_name.len, inode->i_ino, dt_type(inode)); + spin_lock(&dir->d_lock); + return ret; +} + /* * Directory is locked and all positive dentries in it are safe, since * for ramfs-type trees they can't go away without unlink() or rmdir(), * both impossible due to the lock on directory. */ - int dcache_readdir(struct file *file, struct dir_context *ctx) { - struct dentry *dentry = file->f_path.dentry; - struct dentry *cursor = file->private_data; - struct list_head *p, *q = &cursor->d_child; + struct dentry *dentry, *cursor; + struct list_head *entry, *target; if (!dir_emit_dots(file, ctx)) return 0; + + dentry = file->f_path.dentry; spin_lock(&dentry->d_lock); - if (ctx->pos == 2) - list_move(q, &dentry->d_subdirs); - - for (p = q->next; p != &dentry->d_subdirs; p = p->next) { - struct dentry *next = list_entry(p, struct dentry, d_child); - spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED); - if (!simple_positive(next)) { - spin_unlock(&next->d_lock); + + cursor = file->private_data; + entry = (ctx->pos == 2) ? &dentry->d_subdirs : &cursor->d_child; + + target = NULL; + while ((entry = entry->next) != &dentry->d_subdirs) { + struct dentry *child = list_entry(entry, struct dentry, d_child); + + if (!simple_positive(child)) continue; - } - spin_unlock(&next->d_lock); - spin_unlock(&dentry->d_lock); - if (!dir_emit(ctx, next->d_name.name, next->d_name.len, - d_inode(next)->i_ino, dt_type(d_inode(next)))) + /* This will drop and re-take the dentry lock .. */ + if (!dcache_emit_entry(ctx, dentry, child)) return 0; - spin_lock(&dentry->d_lock); - spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED); - /* next is still alive */ - list_move(q, p); - spin_unlock(&next->d_lock); - p = q; + + /* .. but 'entry' is stable because of the shared rwsem */ + target = entry; ctx->pos++; } + if (target) + list_move(&cursor->d_child, target); spin_unlock(&dentry->d_lock); return 0; }